In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os

from tf.convert.tf import explode

# Exploding

This is rewriting a TF feature file in a more straightforward data file.

It is still a text file, but each node-feature value pair occupies a single line,
in which node and value are explicitly stated.

For edge features, the node pair and feature value occupy a single line, with a tab
separating the first node, second node, and the value.

All metadata is lost.

# Rationale

When it comes to (big) data processing, the Text-Fabric library might easily get in the way.
By converting the typical, optimized Text-Fabric feature files to explicit tab separated files
with the same information context, the data can flow smoothly into other systems.

We explode a bunch of small test features in one go.

In [3]:
explode('explode/in', 'explode/out')

explode/in/otext.tf => explode/out/otext.tf:
	! This is a config feature. It has no data.


True

## Config features

In [4]:
!cat explode/in/otext.tf

@config
@fmt:text-orig-full={qere_utf8/g_word_utf8}{qere_trailer_utf8/trailer_utf8}
@sectionFeatures=book,chapter,verse
@sectionTypes=book,chapter,verse


A config feature has no data, so there is no result file.

In [5]:
!cat explode/out/otext.tf

cat: explode/out/otext.tf: No such file or directory


## Node features

### book

In [6]:
!cat explode/in/book.tf

@node
@valueType=str

100	Genesis
Exodus
Leviticus
Numeri
Deuteronomium
200	Josua
Judices
Samuel_I
Samuel_II
Reges_I
Reges_II


watch how the gaps behave

In [7]:
!cat explode/out/book.tf

100	Genesis
101	Exodus
102	Leviticus
103	Numeri
104	Deuteronomium
200	Josua
201	Judices
202	Samuel_I
203	Samuel_II
204	Reges_I
205	Reges_II


### otype

Now a node feature with ranges:

In [8]:
!cat explode/in/otype.tf

@node
@valueType=str

1-3,5	word
6-8	phrase
9,10	clause
11	sentence


In [9]:
!cat explode/out/otype.tf

1	word
2	word
3	word
5	word
6	phrase
7	phrase
8	phrase
9	clause
10	clause
11	sentence


## Edge features

### oslots

An edge feature without values.

In [10]:
!cat explode/in/oslots.tf

@edge
@valueType=int

1	5-10
5-7,8
10	12,14
13,15


In [11]:
!cat explode/out/oslots.tf

1	5
1	6
1	7
1	8
1	9
1	10
2	5
2	6
2	7
2	8
10	12
10	14
11	13
11	15


### `crossref`

An edge feature with integer values.

In [12]:
!cat explode/in/crossref.tf

@edge
@edgeValues
@valueType=int

1	5-10	10
5-7,8	20
10	12,14	75
13,15	80


In [13]:
!cat explode/out/crossref.tf

1	5	10
1	6	10
1	7	10
1	8	10
1	9	10
1	10	10
2	5	20
2	6	20
2	7	20
2	8	20
10	12	75
10	14	75
11	13	80
11	15	80


### mother

An edge feature with string values.

In [14]:
!cat explode/in/mother.tf

@edge
@edgeValues
@valueType=str

1	5-10	apo
5-7,8	epi
10	12,14	dia
13,15	pros


In [15]:
!cat explode/out/mother.tf

1	5	apo
1	6	apo
1	7	apo
1	8	apo
1	9	apo
1	10	apo
2	5	epi
2	6	epi
2	7	epi
2	8	epi
10	12	dia
10	14	dia
11	13	pros
11	15	pros


## Bigger features from the BHSA

In [16]:
bhsaPath = os.path.expanduser('~/github/etcbc/bhsa')
paraPath = os.path.expanduser('~/github/etcbc/parallels')
bhsaTfIn = f'{bhsaPath}/tf/c'
paraTfIn = f'{paraPath}/tf/c'
bhsaTfOut = f'{bhsaPath}/_temp/tf/c'

In [17]:
def example(feat):
    tfIn = paraTfIn if feat == "crossref" else bhsaTfIn
    explode(f"{tfIn}/{feat}.tf", f"{bhsaTfOut}/{feat}.tf")

In [18]:
example('book')

In [19]:
example('g_word_utf8')

In [20]:
example('otype')

In [21]:
example('otext')

~/github/etcbc/bhsa/tf/c/otext.tf => ~/github/etcbc/bhsa/_temp/tf/c/otext.tf:
	! This is a config feature. It has no data.


In [22]:
example('oslots')

In [23]:
example('mother')

In [24]:
example('crossref')

In [25]:
explode(bhsaTfIn, bhsaTfOut)

~/github/etcbc/bhsa/tf/c/otext.tf => ~/github/etcbc/bhsa/_temp/tf/c/otext.tf:
	! This is a config feature. It has no data.


True

### Size

The compact TF files occupy 155 MB on disk, the exploded ones take 453 MB.

In [29]:
!ls -l ~/github/etcbc/bhsa/tf/c

total 302488
-rw-r--r--  1 dirk  staff    199847 Oct  8  2018 book.tf
-rw-r--r--  1 dirk  staff      1573 Oct  8  2018 book@am.tf
-rw-r--r--  1 dirk  staff       894 Oct  8  2018 book@ar.tf
-rw-r--r--  1 dirk  staff      1366 Oct  8  2018 book@bn.tf
-rw-r--r--  1 dirk  staff       768 Oct  8  2018 book@da.tf
-rw-r--r--  1 dirk  staff       747 Oct  8  2018 book@de.tf
-rw-r--r--  1 dirk  staff      1039 Oct  8  2018 book@el.tf
-rw-r--r--  1 dirk  staff       760 Oct  8  2018 book@en.tf
-rw-r--r--  1 dirk  staff       770 Oct  8  2018 book@es.tf
-rw-r--r--  1 dirk  staff       970 Oct  8  2018 book@fa.tf
-rw-r--r--  1 dirk  staff       778 Oct  8  2018 book@fr.tf
-rw-r--r--  1 dirk  staff       894 Oct  8  2018 book@he.tf
-rw-r--r--  1 dirk  staff      1255 Oct  8  2018 book@hi.tf
-rw-r--r--  1 dirk  staff       760 Oct  8  2018 book@id.tf
-rw-r--r--  1 dirk  staff       969 Oct  8  2018 book@ja.tf
-rw-r--r--  1 dirk  staff       832 Oct  8  2018 book@ko.tf
-rw-r--r--  1 dirk  staff     

In [27]:
!ls -l ~/github/etcbc/bhsa/_temp/tf/c

total 916000
-rw-r--r--  1 dirk  staff    391946 Jul  9 16:03 book.tf
-rw-r--r--  1 dirk  staff      1403 Jul  9 16:03 book@am.tf
-rw-r--r--  1 dirk  staff       717 Jul  9 16:03 book@ar.tf
-rw-r--r--  1 dirk  staff      1193 Jul  9 16:03 book@bn.tf
-rw-r--r--  1 dirk  staff       606 Jul  9 16:03 book@da.tf
-rw-r--r--  1 dirk  staff       583 Jul  9 16:03 book@de.tf
-rw-r--r--  1 dirk  staff       867 Jul  9 16:03 book@el.tf
-rw-r--r--  1 dirk  staff       595 Jul  9 16:03 book@en.tf
-rw-r--r--  1 dirk  staff       604 Jul  9 16:03 book@es.tf
-rw-r--r--  1 dirk  staff       804 Jul  9 16:03 book@fa.tf
-rw-r--r--  1 dirk  staff       612 Jul  9 16:03 book@fr.tf
-rw-r--r--  1 dirk  staff       727 Jul  9 16:03 book@he.tf
-rw-r--r--  1 dirk  staff      1081 Jul  9 16:03 book@hi.tf
-rw-r--r--  1 dirk  staff       583 Jul  9 16:03 book@id.tf
-rw-r--r--  1 dirk  staff       801 Jul  9 16:03 book@ja.tf
-rw-r--r--  1 dirk  staff       666 Jul  9 16:03 book@ko.tf
-rw-r--r--  1 dirk  staff     