In [1]:
%load_ext autoreload
%autoreload 2

# Extract

We extract volumes from a big dataset.

In [2]:
import os
from tf.fabric import Fabric
from tf.compose import extract, combine
from tf.core.helpers import unexpanduser

In [3]:
GH = os.path.expanduser("~/github")
BH = f"{GH}/etcbc/bhsa"
VERSION = "2021"
SOURCE = f"{BH}/tf/{VERSION}"
TARGET = f"{BH}/_local/tf/{VERSION}"

# Loading

We load the dataset, and pass its api to the `extract()` function.

If something goes wrong during the extraction, we can inspect the dataset without reloading it.

In a normal scenario, we can just leave out this step. The `extract()` function will
automatically load the dataset if no `api` argument is passed.

In [4]:
TF = Fabric(locations=SOURCE)
api = TF.loadAll()
api.makeAvailableIn(globals())

This is Text-Fabric 8.5.14
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

114 features found and 0 ignored
  0.00s loading features ...
   |     0.00s Dataset without structure sections in otext:no structure functions in the T-API
  3.56s All features loaded/computed - for details use TF.loadLog()
   |     0.00s Feature overview: 109 for nodes; 4 for edges; 1 configs; 8 computed
  0.00s loading features ...
  6.05s All additional features loaded - for details use TF.loadLog()


[('Computed',
  'computed-data',
  ('C Computed', 'Call AllComputeds', 'Cs ComputedString')),
 ('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')),
 ('Fabric', 'loading', ('TF',)),
 ('Locality', 'locality', ('L Locality',)),
 ('Nodes', 'navigating-nodes', ('N Nodes',)),
 ('Features',
  'node-features',
  ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')),
 ('Search', 'search', ('S Search',)),
 ('Text', 'text', ('T Text',))]

In [9]:
VOLUMES = (
    ("Obadiah", "Nahum", "Haggai", "Habakkuk", "Jonah", "Micah"),
    ("Malachi", "Joel"),
    ("Ezra",),
)

In [11]:
volumes = extract(SOURCE, TARGET, volumes=PARTS, api=api, overwrite=True)

Splitting dataset in 39 books:
   |   book Genesis             : with    28763 slots
   |   book Exodus              : with    23748 slots
   |   book Leviticus           : with    17099 slots
   |   book Numbers             : with    23188 slots
   |   book Deuteronomy         : with    20127 slots
   |   book Joshua              : with    14526 slots
   |   book Judges              : with    14085 slots
   |   book 1_Samuel            : with    18929 slots
   |   book 2_Samuel            : with    15612 slots
   |   book 1_Kings             : with    18685 slots
   |   book 2_Kings             : with    17307 slots
   |   book Isaiah              : with    22931 slots
   |   book Jeremiah            : with    29736 slots
   |   book Ezekiel             : with    26182 slots
   |   book Hosea               : with     3146 slots
   |   book Joel                : with     1318 slots
   |   book Amos                : with     2780 slots
   |   book Obadiah             : with      392 slo

# Checkout the volumes

The `extract()` function returns basic information about the volumes:

* long name (all books in the volume)
* short name (used to name its directory on disk)
* location of the volume dataset on the filesystem

In [14]:
for v in volumes:
    print(f"volume {v[1]:<24} at {unexpanduser(v[2])}")

volume Obadiah---Micah          at ~/github/etcbc/bhsa/_local/tf/2021/Obadiah---Micah
volume Malachi-Joel             at ~/github/etcbc/bhsa/_local/tf/2021/Malachi-Joel
volume Ezra                     at ~/github/etcbc/bhsa/_local/tf/2021/Ezra


# Load all volumes

We use the result of the `extract()` function to find and load all volumes.

We now get one TF-api handle per volume.

## totalMap

Note that each volume has an extra feature: `totalMap`. The value for each node in the volume dataset
is the corresponding node in the complete dataset from which the volume is taken.

If you use the volume dataset to compute annotations, and you want to publish these annotations against the complete dataset, the feature `totalMap` provides the necessary information to do so.

Suppose `annotvx` is a dict mapping the some nodes in the dataset of volume `x` to interesting values, then you apply them to the big dataset as follows

``` python

{F.totalMap.v(n): value for (n, value) in annotvx.items}
```

In [15]:
TFs = {}
apis = {}
TF.indent(reset=True)
TF.info("Loading all volumes")
for (longName, name, loc) in volumes:
    TF.info(longName)
    TFs[longName] = Fabric(locations=loc, silent=True)
    apis[longName] = TFs[longName].loadAll(silent="deep")
TF.info("Done")

  0.00s Loading all volumes
  0.00s Obadiah-Nahum-Haggai-Habakkuk-Jonah-Micah
  1.52s Malachi-Joel
  2.19s Ezra
  3.41s Done


# Combine volumes

We can combine volumes by means of the `combine()` function

In [16]:
vNames = tuple(v[1] for v in volumes)
vNames

('Obadiah---Micah', 'Malachi-Joel', 'Ezra')

In [19]:
combine(
    tuple((v, f"{TARGET}/{v}") for v in vNames),
    f"{TARGET}/bible",
    overwrite=True,
    silent=False,
)

  0.00s inspect metadata ...
  0.00s Loading volume Ezra from ~/github/etcbc/bhsa/_local/tf/2021/Ezra ...
This is Text-Fabric 8.5.14
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

115 features found and 0 ignored
  0.00s loading features ...
   |     0.00s Dataset without structure sections in otext:no structure functions in the T-API
  0.07s All features loaded/computed - for details use TF.loadLog()
   |     0.00s Feature overview: 110 for nodes; 4 for edges; 1 configs; 8 computed
  0.00s loading features ...
  0.08s All additional features loaded - for details use TF.loadLog()
  0.71s Loading volume Malachi-Joel from ~/github/etcbc/bhsa/_local/tf/2021/Malachi-Joel ...
This is Text-Fabric 8.5.14
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

115 features found and 0 ignored
  0.00s loading features ...
   |     0.00s Dataset without structure sections in otext:no structure functions in the T-API
  0.05s All features loaded/c

True

Let's see what we have got:

In [20]:
TF = Fabric(locations=f"{TARGET}/bible")
api = TF.loadAll(silent=False)
api.makeAvailableIn(globals())

This is Text-Fabric 8.5.14
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

116 features found and 0 ignored
  0.00s loading features ...
   |     0.02s T otype                from ~/github/etcbc/bhsa/_local/tf/2021/bible
   |     0.20s T oslots               from ~/github/etcbc/bhsa/_local/tf/2021/bible
   |     0.00s Dataset without structure sections in otext:no structure functions in the T-API
   |     0.03s T g_cons_utf8          from ~/github/etcbc/bhsa/_local/tf/2021/bible
   |     0.00s T book@am              from ~/github/etcbc/bhsa/_local/tf/2021/bible
   |     0.00s T book@ru              from ~/github/etcbc/bhsa/_local/tf/2021/bible
   |     0.03s T lex_utf8             from ~/github/etcbc/bhsa/_local/tf/2021/bible
   |     0.03s T g_lex_utf8           from ~/github/etcbc/bhsa/_local/tf/2021/bible
   |     0.00s T qere_trailer_utf8    from ~/github/etcbc/bhsa/_local/tf/2021/bible
   |     0.00s T book@bn              from ~/github/etcbc/bhsa/_loc

[('Computed',
  'computed-data',
  ('C Computed', 'Call AllComputeds', 'Cs ComputedString')),
 ('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')),
 ('Fabric', 'loading', ('TF',)),
 ('Locality', 'locality', ('L Locality',)),
 ('Nodes', 'navigating-nodes', ('N Nodes',)),
 ('Features',
  'node-features',
  ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')),
 ('Search', 'search', ('S Search',)),
 ('Text', 'text', ('T Text',))]

Which books have we got?

In [21]:
for b in F.otype.s("book"):
    print(T.sectionFromNode(b)[0])

Ezra
Joel
Malachi
Obadiah
Jonah
Micah
Nahum
Habakkuk
Haggai


In [22]:
b = T.nodeFromSection(("Obadiah",))
T.text(b, fmt="text-trans-plain")

'XZWN <BDJH KH&>MR >DNJ JHWH L>DWM CMW<H CM<NW M>T JHWH WYJR BGWJM CLX QWMW WNQWMH <LJH LMLXMH00 HNH QVN NTTJK BGWJM BZWJ >TH M>D00 ZDWN LBK HCJ>K CKNJ BXGWJ&SL< MRWM CBTW >MR BLBW MJ JWRDNJ >RY00 >M&TGBJH KNCR W>M&BJN KWKBJM FJM QNK MCM >WRJDK N>M&JHWH00 >M&GNBJM B>W&LK >M&CWDDJ LJLH >JK NDMJTH HLW> JGNBW DJM >M&BYRJM B>W LK HLW> JC>JRW <LLWT00 >JK NXPFW <FW NB<W MYPNJW00 <D&HGBWL CLXWK KL >NCJ BRJTK HCJ>WK JKLW LK >NCJ CLMK LXMK JFJMW MZWR TXTJK >JN TBWNH BW00 HLW> BJWM HHW> N>M JHWH WH>BDTJ XKMJM M>DWM WTBWNH MHR <FW00 WXTW GBWRJK TJMN LM<N JKRT&>JC MHR <FW MQVL00 MXMS >XJK J<QB TKSK BWCH WNKRT L<WLM00 BJWM <MDK MNGD BJWM CBWT ZRJM XJLW WNKRJM B>W C<RW W<L&JRWCLM JDW GWRL GM&>TH K>XD MHM00 W>L&TR> BJWM&>XJK BJWM NKRW W>L&TFMX LBNJ&JHWDH BJWM >BDM W>L&TGDL PJK BJWM YRH00 >L&TBW> BC<R&<MJ BJWM >JDM >L&TR> GM&>TH BR<TW BJWM >JDW W>L&TCLXNH BXJLW BJWM >JDW00 W>L&T<MD <L&HPRQ LHKRJT >T&PLJVJW W>L&TSGR FRJDJW BJWM YRH00 KJ&QRWB JWM&JHWH <L&KL&HGWJM K>CR <FJT J<FH LK GMLK JCWB BR>CK00 KJ K