In [1]:
%load_ext autoreload
%autoreload 2

# Extract

We extract volumes from a big dataset.

In [3]:
import os
from tf.app import use
from tf.fabric import Fabric
from tf.volumes import extract, collect
from tf.core.helpers import unexpanduser

In [4]:
GH = os.path.expanduser("~/github")
BH = f"{GH}/etcbc/bhsa"
VERSION = "2021"
SOURCE = f"{BH}/tf/{VERSION}"
TARGET = f"{BH}/_local/tf/{VERSION}"

In [5]:
VOLUMES = dict(
    tiny=("Obadiah", "Nahum", "Haggai", "Habakkuk", "Jonah", "Micah"),
    small=("Malachi", "Joel"),
    medium=("Ezra",),
)

# Loading

We load the dataset, and pass its api to the `extract()` function.

If something goes wrong during the extraction, we can inspect the dataset without reloading it.

In a normal scenario, we can just leave out this step. The `extract()` function will
automatically load the dataset if no `api` argument is passed.

In [6]:
TF = Fabric(locations=SOURCE)
api = TF.loadAll()
api.makeAvailableIn(globals())

This is Text-Fabric 8.5.14
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

114 features found and 0 ignored
  0.00s loading features ...
   |     0.00s Dataset without structure sections in otext:no structure functions in the T-API
  3.58s All features loaded/computed - for details use TF.loadLog()
   |     0.00s Feature overview: 109 for nodes; 4 for edges; 1 configs; 8 computed
  0.00s loading features ...
  6.28s All additional features loaded - for details use TF.loadLog()


[('Computed',
  'computed-data',
  ('C Computed', 'Call AllComputeds', 'Cs ComputedString')),
 ('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')),
 ('Fabric', 'loading', ('TF',)),
 ('Locality', 'locality', ('L Locality',)),
 ('Nodes', 'navigating-nodes', ('N Nodes',)),
 ('Features',
  'node-features',
  ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')),
 ('Search', 'search', ('S Search',)),
 ('Text', 'text', ('T Text',))]

In [7]:
volumes = extract(SOURCE, TARGET, volumes=VOLUMES, api=api, overwrite=True)

  0.00s Check volumes ...
   |   Volume tiny exists and will be recreated
   |   Volume small exists and will be recreated
   |   Volume medium exists and will be recreated
   |   Work consists of 39 books:
   |   book Genesis             : with    28763 slots
   |   book Exodus              : with    23748 slots
   |   book Leviticus           : with    17099 slots
   |   book Numbers             : with    23188 slots
   |   book Deuteronomy         : with    20127 slots
   |   book Joshua              : with    14526 slots
   |   book Judges              : with    14085 slots
   |   book 1_Samuel            : with    18929 slots
   |   book 2_Samuel            : with    15612 slots
   |   book 1_Kings             : with    18685 slots
   |   book 2_Kings             : with    17307 slots
   |   book Isaiah              : with    22931 slots
   |   book Jeremiah            : with    29736 slots
   |   book Ezekiel             : with    26182 slots
   |   book Hosea               : wit

# Checkout the volumes

The `extract()` function returns basic information about the volumes:

* long name (all books in the volume)
* short name (used to name its directory on disk)
* location of the volume dataset on the filesystem

In [8]:
if volumes:
    for (name, loc) in volumes.items():
        print(f"volume {name:<24}: at {loc}")
else:
    print(volumes)

volume tiny                    : at /Users/dirk/github/etcbc/bhsa/_local/tf/2021/tiny
volume small                   : at /Users/dirk/github/etcbc/bhsa/_local/tf/2021/small
volume medium                  : at /Users/dirk/github/etcbc/bhsa/_local/tf/2021/medium


# Load all volumes

We use the result of the `extract()` function to find and load all volumes.

We now get one TF-api handle per volume.

## owork

Note that each volume has an extra feature: `owork`. Its value for each node in a volume dataset
is the corresponding node in the *original work* from which the volume is taken.

If you use the volume to compute annotations,
and you want to publish these annotations against the original work dataset,
the feature `owork` provides the necessary information to do so.

Suppose `annotvx` is a dict mapping the some nodes in the dataset of volume `x` to interesting values,
then you apply them to the original work as follows

``` python

{F.owork.v(n): value for (n, value) in annotvx.items}
```

In [9]:
TFs = {}
apis = {}
TF.indent(reset=True)
TF.info("Loading all volumes")
for (name, loc) in volumes.items():
    TF.info(f"Loading volume {name} ...")
    TFs[name] = Fabric(locations=loc, silent=True)
    apis[name] = TFs[name].loadAll(silent="deep")
TF.info("Done")

  0.00s Loading all volumes
  0.00s Loading volume tiny ...
  1.59s Loading volume small ...
  2.32s Loading volume medium ...
  4.15s Done


# Collect volumes

We can collect volumes into new works by means of the `collect()` function

Let's collect all volumes.

In [10]:
collect(
    volumes,
    f"{TARGET}/bible",
    overwrite=True,
    silent=False,
)

  0.00s Loading volume tiny                                                         from ~/github/etcbc/bhsa/_local/tf/2021/tiny ...
This is Text-Fabric 8.5.14
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

117 features found and 0 ignored
  0.00s loading features ...
   |     0.00s Dataset without structure sections in otext:no structure functions in the T-API
  0.08s All features loaded/computed - for details use TF.loadLog()
   |     0.00s Feature overview: 112 for nodes; 4 for edges; 1 configs; 8 computed
  0.00s loading features ...
  0.10s All additional features loaded - for details use TF.loadLog()
  0.20s Loading volume small                                                        from ~/github/etcbc/bhsa/_local/tf/2021/small ...
This is Text-Fabric 8.5.14
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

117 features found and 0 ignored
  0.00s loading features ...
   |     0.00s Dataset without structure sections in ote

True

Let's see what we have got:

In [11]:
TF = Fabric(locations=f"{TARGET}/bible")
api = TF.loadAll(silent=False)
api.makeAvailableIn(globals())

This is Text-Fabric 8.5.14
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

118 features found and 0 ignored
  0.00s loading features ...
   |     0.02s T otype                from ~/github/etcbc/bhsa/_local/tf/2021/bible
   |     0.19s T oslots               from ~/github/etcbc/bhsa/_local/tf/2021/bible
   |     0.00s Dataset without structure sections in otext:no structure functions in the T-API
   |     0.00s T book@zh              from ~/github/etcbc/bhsa/_local/tf/2021/bible
   |     0.00s T book@de              from ~/github/etcbc/bhsa/_local/tf/2021/bible
   |     0.03s T lex                  from ~/github/etcbc/bhsa/_local/tf/2021/bible
   |     0.00s T verse                from ~/github/etcbc/bhsa/_local/tf/2021/bible
   |     0.00s T book                 from ~/github/etcbc/bhsa/_local/tf/2021/bible
   |     0.00s T book@el              from ~/github/etcbc/bhsa/_local/tf/2021/bible
   |     0.00s T qere_trailer         from ~/github/etcbc/bhsa/_loc

[('Computed',
  'computed-data',
  ('C Computed', 'Call AllComputeds', 'Cs ComputedString')),
 ('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')),
 ('Fabric', 'loading', ('TF',)),
 ('Locality', 'locality', ('L Locality',)),
 ('Nodes', 'navigating-nodes', ('N Nodes',)),
 ('Features',
  'node-features',
  ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')),
 ('Search', 'search', ('S Search',)),
 ('Text', 'text', ('T Text',))]

Which books have we got?

In [12]:
for b in F.otype.s("book"):
    print(T.sectionFromNode(b)[0])

Obadiah
Jonah
Micah
Nahum
Habakkuk
Haggai
Joel
Malachi
Ezra


In [13]:
b = T.nodeFromSection(("Obadiah",))
T.text(b, fmt="text-trans-plain")

'XZWN <BDJH KH&>MR >DNJ JHWH L>DWM CMW<H CM<NW M>T JHWH WYJR BGWJM CLX QWMW WNQWMH <LJH LMLXMH00 HNH QVN NTTJK BGWJM BZWJ >TH M>D00 ZDWN LBK HCJ>K CKNJ BXGWJ&SL< MRWM CBTW >MR BLBW MJ JWRDNJ >RY00 >M&TGBJH KNCR W>M&BJN KWKBJM FJM QNK MCM >WRJDK N>M&JHWH00 >M&GNBJM B>W&LK >M&CWDDJ LJLH >JK NDMJTH HLW> JGNBW DJM >M&BYRJM B>W LK HLW> JC>JRW <LLWT00 >JK NXPFW <FW NB<W MYPNJW00 <D&HGBWL CLXWK KL >NCJ BRJTK HCJ>WK JKLW LK >NCJ CLMK LXMK JFJMW MZWR TXTJK >JN TBWNH BW00 HLW> BJWM HHW> N>M JHWH WH>BDTJ XKMJM M>DWM WTBWNH MHR <FW00 WXTW GBWRJK TJMN LM<N JKRT&>JC MHR <FW MQVL00 MXMS >XJK J<QB TKSK BWCH WNKRT L<WLM00 BJWM <MDK MNGD BJWM CBWT ZRJM XJLW WNKRJM B>W C<RW W<L&JRWCLM JDW GWRL GM&>TH K>XD MHM00 W>L&TR> BJWM&>XJK BJWM NKRW W>L&TFMX LBNJ&JHWDH BJWM >BDM W>L&TGDL PJK BJWM YRH00 >L&TBW> BC<R&<MJ BJWM >JDM >L&TR> GM&>TH BR<TW BJWM >JDW W>L&TCLXNH BXJLW BJWM >JDW00 W>L&T<MD <L&HPRQ LHKRJT >T&PLJVJW W>L&TSGR FRJDJW BJWM YRH00 KJ&QRWB JWM&JHWH <L&KL&HGWJM K>CR <FJT J<FH LK GMLK JCWB BR>CK00 KJ K

# With the advanced API `A`

So far we have worked with datasets that are essentially one directory with feature files.

But if we load the BHSA with the advanced API, like `A = use("bhsa", ...)`, we also get some standard modules.

Now, if we want to split the BHSA into volumes, we also want to include these features in the volumes.

That is entirely possible, and can be done in a convenient way.

Let's first point at some interesting features and see whether they are loaded right now.

In [14]:
api.isLoaded("lex")

{'kind': 'node', 'type': 'str', 'edgeValues': None}

In [15]:
api.isLoaded("phono")

{}

In [16]:
api.isLoaded("crossref")

{}

So, `lex` is loaded, `phono` and `crossref` are not.

Now we load the BHSA in the advanced way:

In [27]:
A = use("bhsa", hoist=globals(), silent=False)

This is Text-Fabric 8.5.14
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

120 features found and 0 ignored


We check that the features of interest are loaded:

In [28]:
A.isLoaded("lex")

{'kind': 'node', 'type': 'str', 'edgeValues': None}

In [29]:
A.isLoaded("phono")

{'kind': 'node', 'type': 'str', 'edgeValues': None}

In [30]:
A.isLoaded("crossref")

{'kind': 'edge', 'type': 'int', 'edgeValues': True}

We can use `A` to split the loaded work in the same volumes as before.

In [31]:
A.extract(VOLUMES)

  0.00s Check volumes ...
   |   Volume tiny already exists and will not be recreated
   |   Volume small already exists and will not be recreated
   |   Volume medium already exists and will not be recreated
   |   Work consists of 39 books:
   |   book Genesis             : with    28763 slots
   |   book Exodus              : with    23748 slots
   |   book Leviticus           : with    17099 slots
   |   book Numbers             : with    23188 slots
   |   book Deuteronomy         : with    20127 slots
   |   book Joshua              : with    14526 slots
   |   book Judges              : with    14085 slots
   |   book 1_Samuel            : with    18929 slots
   |   book 2_Samuel            : with    15612 slots
   |   book 1_Kings             : with    18685 slots
   |   book 2_Kings             : with    17307 slots
   |   book Isaiah              : with    22931 slots
   |   book Jeremiah            : with    29736 slots
   |   book Ezekiel             : with    26182 slots
 

{}

Now we load a single volume

In [55]:
A = use("bhsa", volume="tiny", hoist=globals(), silent=False)

This is Text-Fabric 8.5.14
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

123 features found and 0 ignored


In [51]:
TF.volumeInfo

'tiny:Obadiah-Nahum-Haggai-Habakkuk-Jonah-Micah'

In [52]:
A.volumeInfo

'tiny:Obadiah-Nahum-Haggai-Habakkuk-Jonah-Micah'