# Text-Fabric

This is testing ground for algorithms to be used in Text-Fabric.

## Range manipulation

### Convert iterable to ranges
Convert an iterable of numbers to a comma separated optimal list of ranges
optimal means:
1. the ranges are sorted
2. adjacent ranges do not share a boundary, i.e. there is a real gap
   between adjacent ranges
**NB:** Unlike in Python, ranges include their boundary values.

The parameter `nlist` can be any iterable, arrays, lists, 
dictionary keys, sets but the items must be numbers, not strings.
The iterable will be sorted and stripped of duplicates.

We also define the converse: from a list of number ranges to an iterable, in this case: a list. The `ranges` parameter is a list of tuples `start` and `end`, which are the boundaries of a range. The boundaries are inclusive. Instead of a tuple, a single integer may be provided.
If `end` is bigger than `start`, the two will be swapped around.
The result will be a sorted list without duplicates.

In [6]:
import collections

In [40]:
def convert_to_ranges(nlist):
    ranges = []
    curstart = None
    curend = None
    for n in sorted(set(nlist)):
        if curstart == None:
            curstart = n
            curend = n
        elif n == curend + 1:
            curend = n
        else:
            ranges.append((curstart, curend))
            curstart = n
            curend = n
    if curstart != None:
        ranges.append((curstart, curend))
    return ranges

In [41]:
def ranges_to_list(ranges):
    covered = set()
    for r in ranges:
        if type(r) is tuple:
            (start, end) = r
            if start > end:
                (end, start) = r
            for i in range(start, end+1):
                covered.add(i)
        else:
            covered.add(r)
    return sorted(covered)

In [42]:
tests_r2l = dict(
    a_empty = [],
    b_simple = [(5,10)],
    c_swapped = [(10,5)],
    d_single = [1,2,3,4,5],
    e_mixed = [1,2, (10,15), 16],
    f_duplicate = [1, (1,5), (3,7), 7],
    g_order = [7, (1,5), (7,3), 1],
)

In [43]:
for (t, data) in sorted(tests_r2l.items()):
    print('[{:<12}] {}'.format(t, ranges_to_list(data)))

[a_empty     ] []
[b_simple    ] [5, 6, 7, 8, 9, 10]
[c_swapped   ] [5, 6, 7, 8, 9, 10]
[d_single    ] [1, 2, 3, 4, 5]
[e_mixed     ] [1, 2, 10, 11, 12, 13, 14, 15, 16]
[f_duplicate ] [1, 2, 3, 4, 5, 6, 7]
[g_order     ] [1, 2, 3, 4, 5, 6, 7]


In [44]:
tests_l2r = dict(
    a_empty = [],
    b_consec = range(5,10),
    c_mixed = ranges_to_list([(1,4), (5,8), (10,12), (100, 110)]),
    d_mixed_up = [1,2, 5,6,7, 2,3, 3,4, 12,11,10],
)

In [45]:
for (t, data) in sorted(tests_l2r.items()):
    print('[{:<12}] {}'.format(t, convert_to_ranges(data)))

[a_empty     ] []
[b_consec    ] [(5, 9)]
[c_mixed     ] [(1, 8), (10, 12), (100, 110)]
[d_mixed_up  ] [(1, 7), (10, 12)]


# Exporting otype
We want to export the `otype` feature in compact form, like this:

    0-425000 word
    500-700000 phrase

and so on for each object type.

We do the exercise for the brandnew data version ETCBC4c.

In [46]:
from laf.fabric import LafFabric
#from etcbc.preprocess import prepare
fabric = LafFabric()

  0.00s This is LAF-Fabric 4.8.3
API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html
Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html



In [57]:
fabric.load('etcbc4c', '--', 'TF', {
    "xmlids": {"node": False, "edge": False},
    "features": ('''
        otype monads g_word_utf8 trailer_utf8
    ''',''),
#    "prepare": prepare,
    "primary": False,
})
exec(fabric.localnames.format(var='fabric'))

  0.00s LOADING API: please wait ... 
  0.00s USING main: etcbc4c DATA COMPILED AT: 2016-11-09T19-16-37
  1.06s INFO: DATA LOADED FROM SOURCE etcbc4c AND ANNOX  FOR TASK TF AT 2016-11-14T10-29-06


In [48]:
otypes = '''
    word
    subphrase
    phrase_atom
    phrase
    clause_atom
    clause
    sentence_atom
    sentence
    half_verse
    verse
    chapter
    book
'''.strip().split()

In [55]:
ranges = {}
for otype in otypes:
    these_ranges = convert_to_ranges(F.otype.s(otype))
    inf('{:<15} {:>5} ranges'.format(
        otype, len(these_ranges),
    ), withtime=True)
    ranges[otype] = these_ranges
for (otype, ((start, end),)) in sorted(ranges.items(), key=lambda x: x[1][0][0]):
    print('{:<15} from {:>7} to {:>7}'.format(otype, start, end))

11m 07s word                1 ranges
11m 08s subphrase           1 ranges
11m 09s phrase_atom         1 ranges
11m 10s phrase              1 ranges
11m 11s clause_atom         1 ranges
11m 12s clause              1 ranges
11m 13s sentence_atom       1 ranges
11m 14s sentence            1 ranges
11m 15s half_verse          1 ranges
11m 17s verse               1 ranges
11m 18s chapter             1 ranges
11m 19s book                1 ranges
word            from       0 to  426580
clause          from  426581 to  514580
clause_atom     from  514581 to  605142
phrase          from  605143 to  858316
phrase_atom     from  858317 to 1125831
sentence        from 1125832 to 1189401
sentence_atom   from 1189402 to 1253740
subphrase       from 1253741 to 1367532
book            from 1367533 to 1367571
chapter         from 1367572 to 1368500
half_verse      from 1368501 to 1413680
verse           from 1413681 to 1436893


# Exporting a TF set

We will export the ETCBC4c as a Text-Fabric dataset.

That means: a *backbone* consisting of the `otype` feature and the `extent` edge.

On top of that: the feature files. At first, we just pick two: `g_word_utf8` and `trailer_utf8`.
Together they form exactly the plain text of the Hebrew Bible in ETCBC4b.

The `otype` feature maps every node id onto it object type (otype).

The `extent` edge feature contains unlabeled edges between each object and every *monad* belonging to it.

In [None]:
def tf_otype():