# Format of TF files
A `.tf` feature file starts with a *header*, and is followed by the actual data.
The whole file is a plain text in UNICODE-utf8.

## Header
The header consists of two portions: *metadata* and *comment*.

### Metadata
A `.tf` feature file always starts with one or more lines of the form

    @key

or

    @key=value

The first line must be either

    @node

or 

    @edge

This tells Text-Fabric whether the data in the feature file is a *node* feature or an *edge* feature.
The rest of the metadata is optional for now, but it is recommended to put a date stamp in it like this

    @dateCreated=2016-11-20T13:26:59Z

The time format should be [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601).
    
## Data
After the metadata, there must be exactly one blank line, and everything there after are data lines.

The form of a data line is

    node_spec value

for node features, and

    node_spec node_spec value

for edge features.

These fields are separated by single tabs.

### Node Spec
Every line contains a feature value that pertains to all nodes defined by its *node_spec*, or to
all edges defined by its pair of *node_spec*s.

A node spec denotes a *set* of nodes.

The simplest form of a node spec is just a single integer. Examples:

    3
    45
    425000

Ranges are also allowed. Examples

    1-10
    5-13
    28-57045

The nodes denoted by a range are all numbers between the endpoints of the range (including at both sides).
So

    2-4

denotes the nodes `2`, `3`, and `4`.

You can also combine numbers and ranges arbitrarily by separating them with commas. Examples

    1-3,5-10,15,23-37

Such a specification denotes the union of what is denoted by each comma-separated part.

**NB** As node specs denote *sets* of nodes, the following node specs are in fact equivalent

    1,1 and 1
    2-3 and 3,2
    1-5,2-7 and 1-7

We will also be tolerant in that you may specify the end points of ranges in arbitrary order:

    1-3 is the same as 3-1
    
#### Edges
An edge is specified by an *ordered* pair of nodes. The edge is *from* the first node in the pair *to* the second one.
An edge spec consists of two node specs. It denotes all edges that are *from* a node denoted by the first node spec
and *to* a node denoted by the second node spec.
An edge might be labeled, in that case the label of the edge is specified by the *value* after the two node specs.

#### Optimization: implicit node
You may leave out the node spec for node features, and the first node spec for edge features. In that case, you also must leave out the tab following the node spec.
If you leave it out, the node denoted is the singleton set consisting of the *implicit node*. 
Here are the rules for implicit nodes.

* On a line where there is an explicit node spec, the implicit node is equal to the highest node
  denoted by the explicit node spec.
* On the first line, the implicit node is just `0`
* On all other lines, if there is no explicit node spec, the implicit node spec is equal to the 
  implicit node spec of the previous line plus 1

For edges, this optimization only happens for the *first* node spec.
The second node spec must always be explicit.

This optimizes some feature files greatly, e.g. the feature that contains the actual text of each word.

Instead of

    0 be
    1 reshit
    2 bara
    3 elohim
    4 et
    5 ha
    6 shamajim
    7 we
    8 et
    9 ha
    10 arets

you can just say

    be
    reshit
    bara
    elohim
    et
    ha
    shamajim
    we
    et
    ha
    arets
    
This optimization is not obligatory. It is a device that may be used
if you want to optimize the size of data files that you want to distribute.

### Value

The value is arbitrary text. 
We do not distinguish types: all values are taken as unicode utf8 strings.
There are a few escapes:
* `\\` backslash
* `\t` tab
* `\n` newline
Thes characters MUST always be escaped in a value string, otherwise the line as a whole might be ambiguous.

#### Optimization: empty value

If the value is the empty string, you may also leave out the preceding tab (if there is one).
This is especially good for edge features, because most edges just consist of a node pair without any value.

This optimization will cause a conceptual ambiguity if there is only 1 field present, or if there are only two fields in an edge feature. It could mean that the (first) node spec has been left out, or that the value has been left out.
In those cases we will assume that the node spec has been left out for node features, and that the value has been
left out for edge features.

So, in a node feature a line like this

    42

means that the implicit node gets value `42`, and not that node `42` gets the empty value.

Likewise, an edge feature line like this

    42 43
 
means that there is an edge from `42` to `43` with empty value, and not that there is an edge from the implicit node
to `42` with value 43.

An an edge feature line like this

    42

means that there is an edge from the implicit node to `42` with the empty value, and not that there is an
edge from the implicit node to itself with the value `42`.

The reason for these conventions is practical: edge features usually have empty labels, and there are many edges.
In case of the ETCBC database, there are 1.5 million edges, so every extra character that is needed on a data line
means that the filesize increases with 1.5 MB.

Nodes on the other hand, usually do not have empty values, and they are often specified in a consecutive way,
especially word nodes. There are quite many distinct word features, and it would be a waste to have a column of half a million incremental integers in those files.

### Dynamics

If a node is denotes by node specs on several lines, the value on the last such line will determine what gets
assigned to that node.

## Examples

Here are a few more and less contrived examples of legal feature data lines.

### Node features

1. `\t\n`
1. `1 2\t3`
1. `foo\nbar`
1. `1 Escape \\t as \\\\t`

meaning

1. node 0 has value tab-newline
1. node 1 has value 2 tab 3
1. node 2 has value foo newline bar
1. node 1 gets a new value: the string `Escape \t as \\t`

### Edge features

1. `1 2\tfoo`
1. `1 2 foo`
1. `0-1 1-2 bar`

meaning

1. edge from 0 to 1 with value 2 tab foo
1. edge from 1 to 2 with value foo
1. four edges: 0->1, 0->2, 1->1, 1->2, all with value bar.
   Note that edges can go from a node to itself.
   Note also that two edges get new values here: 0->1 and 1->2.





In [1]:
%load_ext autoreload
%autoreload 2

In [61]:
import sys
sys.path.append('..')
from tf.fabric import Fabric
from tf.feature import Feature
from tf.timestamp import Timestamp

In [3]:
from laf.fabric import LafFabric
from etcbc.preprocess import prep
fabric = LafFabric()

  0.00s This is LAF-Fabric 4.8.3
API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html
Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html



In [4]:
fabric.load('etcbc4c', '--', 'TF', {
    "xmlids": {"node": False, "edge": False},
    "features": ('''
        otype monads g_word_utf8 trailer_utf8
        sp
    ''','''
        mother
        functional_parent
    '''),
    "prepare": prep(select={'L'}),
    "primary": False,
}, verbose='DETAIL')
exec(fabric.localnames.format(var='fabric'))

  0.00s LOADING API: please wait ... 
  0.00s DETAIL: COMPILING m: etcbc4c: UP TO DATE
  0.00s USING main: etcbc4c DATA COMPILED AT: 2016-11-09T19-16-37
  0.01s DETAIL: load main: G.node_anchor_min
  0.10s DETAIL: load main: G.node_anchor_max
  0.20s DETAIL: load main: G.node_sort
  0.25s DETAIL: load main: G.node_sort_inv
  0.64s DETAIL: load main: G.edges_from
  0.69s DETAIL: load main: G.edges_to
  0.76s DETAIL: load main: F.etcbc4_db_monads [node] 
  1.41s DETAIL: load main: F.etcbc4_db_otype [node] 
  2.00s DETAIL: load main: F.etcbc4_ft_g_word_utf8 [node] 
  2.50s DETAIL: load main: F.etcbc4_ft_sp [node] 
  2.90s DETAIL: load main: F.etcbc4_ft_trailer_utf8 [node] 
  3.08s DETAIL: load main: F.etcbc4_ft_functional_parent [e] 
  3.31s DETAIL: load main: F.etcbc4_ft_mother [e] 
  3.37s DETAIL: load main: C.etcbc4_ft_functional_parent -> 
  4.08s DETAIL: load main: C.etcbc4_ft_mother -> 
  4.21s DETAIL: load main: C.etcbc4_ft_functional_parent <- 
  4.60s DETAIL: load main: C.etcbc4_

In [5]:
sp = {}
for n in F.otype.s('word'):
    sp[n] = F.sp.v(n)

In [6]:
tm = Timestamp()
info = lambda msg: tm.info(msg)
error = lambda msg: tm.info(msg)

testDir = '../tests'

In [7]:
TF = {}
features = ('otype', 'monads')
for feat in features:
    tm.reset()
    info('Reading feature "{}"\n'.format(feat))
    TF[feat] = Feature('{}/{}.tf'.format(testDir, feat))
    good = TF[feat].load(asName=feat)

    if not good:
        print('There were errors')
    else:
        label = 'edge' if TF[feat].isEdge else 'node'
        print('{} feature\n{}\nData: {} {}s'.format(
            label,
            '\n'.join('{:<20}: {}'.format(*x) for x in sorted(TF[feat].metaData.items())),
            len(TF[feat].data),
            label,
        ))
    info('Done\n')

  0.00s Reading feature "otype"
node feature
createdBy           : Text-Fabric
dateCreated         : 2016-11-20T11:05:06+07
Data: 1436894 nodes
  0.74s Done
  0.00s Reading feature "monads"
edge feature
createdBy           : Text-Fabric
dateCreated         : 2016-11-20T11:05:06+07
Data: 1010313 edges
  1.98s Done


In [8]:
TF['monads'].data[426581]

{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}

In [15]:
TF['monads'].dataLoaded = False

In [16]:
TF['monads'].load()

True

In [11]:
for x in (426580, 426581):
    print('{:>6} is a {}'.format(x, TF['otype'].data[x]))

426580 is a word
426581 is a clause


In [12]:
tm.reset()
info('Re-reading metadata only\n')
TF['otype'].readTf(metaOnly=True)
info('Done\n')

  0.00s Re-reading metadata only
  0.00s Done


In [13]:
tm.reset()
info('writing otype to "myOtype"\n')
TF['otype'].writeTf(fileName='myOtype')
info('Done\n')

  0.00s writing otype to "myOtype"
  2.30s Done


In [14]:
TF['sp'] = Feature(
    '{}/{}.tf'.format(testDir, 'sp'), 
    asName='sp', data=sp, isEdge=False, edgeValues=False,
    metaData=dict(
        convertedFrom='LAF-Fabric',
    ),
)
tm.reset()
info('writing new feature "{}"\n'.format('psp'))
TF['sp'].writeTf(fileName='psp')
info('Done\n')

  0.00s writing new feature "psp"
  0.74s Done


In [15]:
TF[feat].writeTf()
info('writing "{}" feature\n'.format('monads'))
TF[feat].writeTf(fileName='monads_meta', metaOnly=True)
info('Done\n')

Feature file "../tests/monads.tf" already exists, feature will not be written


  3.08s writing "monads" feature
  3.08s Done


In [16]:
tm.reset()
info('writing binary "{}" feature\n'.format('monads'))
TF['monads'].writeDataBin()
info('Done {}\n'.format(TF['monads'].dataName))

  0.00s writing binary "monads" feature
  1.31s Done None


In [17]:
tm.reset()
info('reading binary "{}" feature\n'.format('monads'))
TF['monads'].readDataBin()
info('Done {}\n'.format(TF['monads'].dataName))

  0.00s reading binary "monads" feature
  2.52s Done None


In [47]:
from tf.feature import averageMonadLength
averageMonadLength(TF['otype'].data, TF['monads'].data, 'word')

[('book', 10937.97435897436),
 ('chapter', 459.1829924650161),
 ('verse', 18.376814715891957),
 ('half_verse', 9.441810535635236),
 ('sentence', 6.710413717162184),
 ('sentence_atom', 6.6302087380904275),
 ('clause', 4.847511363636364),
 ('clause_atom', 4.71037521256156),
 ('phrase', 1.6849321020325942),
 ('phrase_atom', 1.5946059099489749),
 ('subphrase', 1.4241071428571428),
 ('word', 1.0)]

In [17]:
TF['monads'].load()

data already loaded and up to date


True

In [77]:
API = Fabric(locations='../tests')

  0.00s Looking for available data features:
  0.00s   monads               from /Users/dirk/github/text-fabric/tests/monads.tf
  0.00s   monads_meta          from /Users/dirk/github/text-fabric/tests/monads_meta.tf
  0.00s   myOtype              from /Users/dirk/github/text-fabric/tests/myOtype.tf
  0.00s X otype                from /Users/dirk/github/text-fabric/notebooks/otype.tf
  0.01s   otype                from /Users/dirk/github/text-fabric/tests/otype.tf
  0.01s   psp                  from /Users/dirk/github/text-fabric/tests/psp.tf
  0.01s   sp                   from /Users/dirk/github/text-fabric/tests/sp.tf
  0.01s 6 features found

In [79]:
API.load('''
    otype
    monads
''')

  0.29s otype                loaded from /Users/dirk/github/text-fabric/tests/otype.tf
  2.14s monads               loaded from /Users/dirk/github/text-fabric/tests/monads.tf
  2.43s All features loaded


True