![dans](images/dans.png)
![tf](images/tf-small.png)

# Turn your corpus into a Text-Fabric dataset

## Corpus

We start with a baby corpus:

1 book, with 2 chapters, each having one or two sentences.

Here is the complete corpus source material:

In [1]:
source = '''
# Consider Phlebas
$ author=Iain M. Banks

## 1
Everything about us,
everything around us,
everything we know [and can know of] is composed ultimately of patterns of nothing;
that’s the bottom line, the final truth.

So where we find we have any control over those patterns,
why not make the most elegant ones, the most enjoyable and good ones,
in our own terms?

## 2
Besides,
it left the humans in the Culture free to take care of the things that really mattered in life,
such as [sports, games, romance,] studying dead languages,
barbarian societies and impossible problems,
and climbing high mountains without the aid of a safety harness.
'''

Note a few details:

* `#` marks the title of a book, section level 1.
* `##` marks the number of a chapter, section level 2.
* `$` starts a line with key=value metadata: the author.
* *blank lines* split sentences.
  There are 2 sentences in the first chapter and 1 in the second one.
* We will give each sentence a number within its section.
* The sentences are divided into lines.
* We will give each line a number within its sentence.
* Words within [ ] will not be part of the line, the line has a gap.
* The gapped words will have a feature `gap=1`.
* Lines will be split into words, the slot nodes.
* We separate the word from its punctuation, which gets added in a `punc` feature.


## Fire up

Now we start the engines: Text-Fabric, and the *walker* conversion module.

In [2]:
import os
import re

from tf.fabric import Fabric
from tf.convert.walker import CV

We call up TF and let it look into the directory where the output has to land,
in this case your download directory.
It will not like what it sees, but that is not our concern.

In [3]:
TF_DIR = os.path.expanduser('~/Downloads/banks/tf')

TF = Fabric(locations=TF_DIR)

This is Text-Fabric 7.4.3
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

10 features found and 0 ignored


Next we initialize the conversion machinery: we obtain an object with methods.

In [4]:
cv = CV(TF)

## TF configuration

A Text-Fabric dataset is a bunch of individual `.tf` files that start with a little bit of metadata and then contain 
a stream of data, typically the values of a single feature for each node or edge in the graph.

We specify the metadata bit by bit.

### slot type

A crucial design aspect of each TF dataset is its granularity. What are the slots?

Words, morphemes, characters?

You decide.

In [5]:
slotType = 'word'

### Provenance

Users that encounter your tf data in the wild, will be thankful to you if you took the
trouble to say in a few key-value pairs what this is about.

The metadata you specify here will end up in all generated tf features.

In [6]:
generic = {
    'name': 'Culture quotes from Iain Banks',
    'compiler': 'Dirk Roorda',
    'source': 'Good Reads',
    'url': 'https://www.goodreads.com/work/quotes/14366-consider-phlebas',
}

### Text matters

A few things concerning the presentation of your text can be specified in the `otext` feature.
This is a tf feature without data, it has only metadata.

It contains the specs for the section structure of your corpus and the text formats.

#### Section structure

Currently, TF assumes that there are two or three section levels.
For each level you have to specify the corresponding node type and the feature that contains the section name or number.

#### Text formats

When you ask TF to render slot nodes (the nodes with text), TF needs to know
which features to render. 

A text format is a template with placeholders for the features you want to use.

In [7]:
otext = {
    'fmt:text-orig-full': '{letters}{punc} ',
    'sectionTypes': 'book,chapter',
    'sectionFeatures': 'title,number',
}

### Typing

The values of features are usually strings.
But if you know that they are always integers (or None), you can declare a feature as an integer valued feature.

The only thing you have to do is to declare them in the following set:

In [8]:
intFeatures = {
  'number',
  'gap'
}

### Descriptions

You can say per feature what it does.
Use as many key-values as you like.

A good *description* is particularly helpful.

It is surprising how often you want to consult those descriptions yourself.

In [9]:
featureMeta = {
    'number': {
        'description': 'number of chapter, or sentence in chapter, or line in sentence',
    },
    'gap': {
        'description': '1 for words that occur between [ ]',
    },
    'title': {
        'description': 'the title of a book',
    },
    'author': {
        'description': 'the author of a book',
    },
    'terminator': {
        'description': 'the last character of a line',
    },
    'letters': {
        'description': 'the letters of a word',
    },
    'punc': {
        'description': 'the punctuation after a word',
    },
}

## Director

This is the heart of the matter.

You program the director so that is unwraps your source material.

Every time the director encounters a new textual object in the source,
it issues an action requesting a new node.
The director gets a receipt, by which it can issue subsequent
actions for that node, like adding feature values to it.

And when the object is done, the director issues a `terminate` action.

During all this, the `cv` machine is busy to translate these actions into
the construction of a proper TF graph representing all the
source material that you have exposed to it.

A few things to note

* If you want to terminate a node, you do not have to worry whether the node exists or has already
  been terminated: just do it
* If you have terminated a node, you can still resume it
* When you add nodes, slot and non-slots, there is magic behind the scenes:
  * when a slot node is added, it will be linked to all active non-slot nodes,
    i.e. the ones that have not been terminated or have been resumed;
  * when a non slot node is added, is becomes automatically active,
    in the sense that it will be linked to subsequent slot nodes, before it is terminated,
    or after it has been resumed.
* If a fatal error is encountered, the director can simply say `cv.stop(message)`.
* After the director is done, TF will perform several checks on the result.

In [10]:
def director(cv):
  counter = dict(
    sentence=0,
    line=0,
  )
  cur = dict(
    book=None,
    chapter=None,
    sentence=None,
  )

  wordRe = re.compile(r'^(.*?)([^A-Za-z0-9]*)$')
  metaRe = re.compile(r'^\$\s*([^= ]+)\s*=\s*(.*)')

  for line in source.strip().split('\n'):
    line = line.rstrip()
    if not line:
      cv.terminate(cur['sentence'])              # action
      for ntp in counter:
        counter[ntp] += 1
      cur['sentence'] = cv.node('sentence')      # action
      cv.feature(
        cur['sentence'],
        number=counter['sentence'],
      )                                          # action
      continue
      
    if line.startswith('# '):
      for ntp in ('sentence', 'chapter', 'book'):
        cv.terminate(cur[ntp])                   # action
        cur[ntp] = None         
      title = line[2:].strip()
      cur['book'] = cv.node('book')              # action
      for ntp in counter:
        counter[ntp] = 0
      cv.feature(
        cur['book'],
        title=title,
      )                                          # action
      continue

    if line.startswith('## '):
      for ntp in ('sentence', 'chapter'):
        cv.terminate(cur[ntp])                   # action
        cur[ntp] = None         
      number = line[2:].strip()
      cur['chapter'] = cv.node('chapter')        # action
      for ntp in counter:
        counter[ntp] = 0
      cv.feature(
        cur['chapter'],
        number=number,
      )                                          # action
      continue

    if line.startswith('$'):
      match = metaRe.match(line)
      if not match:
        cv.stop(f'Malformed metadata line: "{line}"') # action
        return
      name = match.group(1)
      value = match.group(2)
      cv.feature(
        cur['book'],
        **{name: value},
      )                                           # action
      continue
        
    if not cur['sentence']:
      cur['sentence'] = cv.node('sentence')       # action
      counter['sentence'] += 1
      cv.feature(
        cur['sentence'],
        number=counter['sentence'],
      )                                           # action
      
    cur['line'] = cv.node('line')                 # action
    counter['line'] += 1
    cv.feature(
      cur['line'],
      terminator=line[-1],
      number=counter['line'],
    )                                              # action
    
    gap = False
    for word in line.split():
      if word.startswith('['):
        gap = True
        cv.terminate(cur['line'])   # action
        w = cv.slot()               # action
        cv.feature(w, gap=1)        # action
        word = word[1:]
      elif word.endswith(']'):
        w = cv.slot()               # action
        cv.resume(cur['line'])      # action
        cv.feature(w, gap=1)        # action
        gap = False
        word = word[0:-1]
      else:
        w = cv.slot()
        if gap:
          cv.feature(w, gap=1)      # action

      (letters, punc) = wordRe.findall(word)[0]
      cv.feature(w, letters=letters)            # action
      if punc:
        cv.feature(w, punc=punc)                # action
    cv.terminate(cur['line'])                   # action
    curLine = None
    
  for ntp in ('sentence', 'chapter', 'book'):
    cv.terminate(cur[ntp])                      # action
    

## Run

We are going to run the conversion and check whether all is well.

In [11]:
good = cv.walk(
    director,
    slotType,
    otext=otext,
    generic=generic,
    intFeatures=intFeatures,
    featureMeta=featureMeta,
)

good

  0.00s Importing data from walking through the source ...
   |     0.00s Preparing metadata... 
   |   SECTION TYPES:    book, chapter
   |   SECTION FEATURES: title, number
   |   TEXT    FEATURES:
   |      |   text-orig-full       letters, punc
   |     0.00s OK
   |     0.00s Following director... 
   |     0.00s "edge" actions: 0
   |     0.00s "feature" actions: 144
   |     0.00s "node" actions: 20
   |     0.00s "resume" actions: 2
   |     0.00s "slot" actions: 99
   |     0.00s "terminate" actions: 27
   |          1 x "book" node 
   |          2 x "chapter" node 
   |         12 x "line" node 
   |          5 x "sentence" node 
   |         99 x "word" node  = slot type
   |        119 nodes of all types
   |     0.01s OK
   |     0.00s Removing unlinked nodes ... 
   |      |    -0.00s      2 unlinked "sentence" nodes: [1, 4]
   |      |     0.00s      2 unlinked nodes
   |      |     0.00s Leaving    117 nodes
   |     0.00s checking for nodes and edges ... 
   |     0.0

True

## Inspect

Let's inspect some of the files:

### otype

In [12]:
with open(f'{TF_DIR}/otype.tf') as fh:
  print(fh.read())

@node
@compiler=Dirk Roorda
@name=Culture quotes from Iain Banks
@source=Good Reads
@url=https://www.goodreads.com/work/quotes/14366-consider-phlebas
@valueType=str
@writtenBy=Text-Fabric
@dateWritten=2019-01-30T15:15:20Z

1-99	word
100	book
101-102	chapter
103-114	line
115-117	sentence



### otext

In [13]:
with open(f'{TF_DIR}/otext.tf') as fh:
  print(fh.read())

@config
@compiler=Dirk Roorda
@fmt:text-orig-full={letters}{punc} 
@name=Culture quotes from Iain Banks
@sectionFeatures=title,number
@sectionTypes=book,chapter
@source=Good Reads
@url=https://www.goodreads.com/work/quotes/14366-consider-phlebas
@writtenBy=Text-Fabric
@dateWritten=2019-01-30T15:15:20Z




### oslots

In [14]:
with open(f'{TF_DIR}/oslots.tf') as fh:
  print(fh.read())

@edge
@compiler=Dirk Roorda
@name=Culture quotes from Iain Banks
@source=Good Reads
@url=https://www.goodreads.com/work/quotes/14366-consider-phlebas
@valueType=str
@writtenBy=Text-Fabric
@dateWritten=2019-01-30T15:15:20Z

100	1-99
1-55
56-99
1-3
4-6
7-9,14-20
21-27
28-38
39-51
52-55
56
57-75
76-77,81-83
84-88
89-99
1-27
28-55
56-99



#### Explanation

* line `100	1-99` tells that node 100 (the first book node, see *otext*, is linked to slots 1-99, which are all slots.
* the next line has only `1-55`. These are the slots of node 101, being 1 + the previous node.

And so on.

### number

In [15]:
with open(f'{TF_DIR}/number.tf') as fh:
  print(fh.read())

@node
@compiler=Dirk Roorda
@description=number of chapter, or sentence in chapter, or line in sentence
@name=Culture quotes from Iain Banks
@source=Good Reads
@url=https://www.goodreads.com/work/quotes/14366-consider-phlebas
@valueType=int
@writtenBy=Text-Fabric
@dateWritten=2019-01-30T15:15:20Z

101	1
2
1
2
3
4
6
7
8
1
2
3
4
5
1
2
1



#### Explanation

Node 101 is the first chapter node, which has chapter number 1.

The next line is about node 102, the second chapter, with number 2.

The following line refers to node 103, and a quick glance at the *otype* feature shows that this is a line.

The last three lines are about the three sentences, which are numbered within their chapter:
`1` then `2` and then again `1`.

### letters

In [16]:
with open(f'{TF_DIR}/letters.tf') as fh:
  print(fh.read())

@node
@compiler=Dirk Roorda
@description=the letters of a word
@name=Culture quotes from Iain Banks
@source=Good Reads
@url=https://www.goodreads.com/work/quotes/14366-consider-phlebas
@valueType=str
@writtenBy=Text-Fabric
@dateWritten=2019-01-30T15:15:20Z

Everything
about
us
everything
around
us
everything
we
know
and
can
know
of
is
composed
ultimately
of
patterns
of
nothing
that’s
the
bottom
line
the
final
truth
So
where
we
find
we
have
any
control
over
those
patterns
why
not
make
the
most
elegant
ones
the
most
enjoyable
and
good
ones
in
our
own
terms
Besides
it
left
the
humans
in
the
Culture
free
to
take
care
of
the
things
that
really
mattered
in
life
such
as
sports
games
romance
studying
dead
languages
barbarian
societies
and
impossible
problems
and
climbing
high
mountains
without
the
aid
of
a
safety
harness



#### Explanation

The plain, clean text of everything.

## Load TF

We are going to load the new data: all features.

We start a new instance of the TF machinery.

In [17]:
TF = Fabric(locations=TF_DIR)

This is Text-Fabric 7.4.3
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

10 features found and 0 ignored


We ask for a list of all features:

In [18]:
allFeatures = TF.explore(silent=True, show=True)
loadableFeatures = allFeatures['nodes'] + allFeatures['edges']
loadableFeatures

('author',
 'gap',
 'letters',
 'number',
 'otype',
 'punc',
 'terminator',
 'title',
 'oslots')

We load all features:

In [19]:
api = TF.load(loadableFeatures, silent=False)

  0.00s loading features ...
   |     0.00s T otype                from /Users/dirk/Downloads/banks/tf
   |     0.00s T oslots               from /Users/dirk/Downloads/banks/tf
   |     0.00s T title                from /Users/dirk/Downloads/banks/tf
   |     0.00s T number               from /Users/dirk/Downloads/banks/tf
   |     0.00s T letters              from /Users/dirk/Downloads/banks/tf
   |     0.00s T punc                 from /Users/dirk/Downloads/banks/tf
   |      |     0.00s C __levels__           from otype, oslots, otext
   |      |     0.00s C __order__            from otype, oslots, __levels__
   |      |     0.00s C __rank__             from otype, __order__
   |      |     0.00s C __levUp__            from otype, oslots, __levels__, __rank__
   |      |     0.00s C __levDown__          from otype, __levUp__, __rank__
   |      |     0.00s C __boundary__         from otype, oslots, __rank__
   |      |     0.00s C __sections__         from otype, oslots, otext, __le

You see that all files are marked with a `T`.

That means that Text-Fabric loads the features by reading the plain text `.tf` files.
But after reading, it makes a binary equivalent and stores it as a `.tfx`
file in the hidden `.tf` directory next to it.

Furthermore, you see some lines marked with `C`. Here Text-Fabric is computing derived data,
mostly about sections, the order of nodes, and the relative positions of nodes with respect to the slots they
are linked to.

The results of this pre-computation are also stored in that hidden `.tf` directory.

The next time, Text-Fabric loads the data from their binary forms, which is much faster.
And the pre-computation step will be skipped.

If the binary files get outdated Text-Fabric will recompile and recompute everything automatically.

So let's load again.

In [20]:
TF = Fabric(locations=TF_DIR)
api = TF.load(loadableFeatures, silent=False)

This is Text-Fabric 7.4.3
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

10 features found and 0 ignored
  0.00s loading features ...
   |     0.00s B otype                from /Users/dirk/Downloads/banks/tf
   |     0.00s B oslots               from /Users/dirk/Downloads/banks/tf
   |     0.00s B title                from /Users/dirk/Downloads/banks/tf
   |     0.00s B number               from /Users/dirk/Downloads/banks/tf
   |     0.00s B letters              from /Users/dirk/Downloads/banks/tf
   |     0.00s B punc                 from /Users/dirk/Downloads/banks/tf
   |     0.00s B author               from /Users/dirk/Downloads/banks/tf
   |     0.00s B gap                  from /Users/dirk/Downloads/banks/tf
   |     0.00s B terminator           from /Users/dirk/Downloads/banks/tf
  0.03s All features loaded/computed - for details use loadLog()


Where there were `T`s before, there are now `B`s.

### Hoisting

We can access all TF data programmatically by using `api.Features`, or `api.F` (same thing) and a bunch of
other API members. 

But if we working with a single data source, we can hoist those API members to the global namespace.

This is not a thing to be done when you write modules for other people, but if you are the user yourself,
why should not you make life just a little bit easier?

In [21]:
api.makeAvailableIn(globals())

[('Computed',
  'computed-data',
  ('C Computed', 'Call AllComputeds', 'Cs ComputedString')),
 ('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')),
 ('Fabric', 'loading', ('ensureLoaded', 'TF', 'ignored', 'loadLog')),
 ('Locality', 'locality', ('L Locality',)),
 ('Misc', 'messaging', ('cache', 'error', 'indent', 'info', 'reset')),
 ('Nodes',
  'navigating-nodes',
  ('N Nodes', 'sortKey', 'sortKeyTuple', 'otypeRank', 'sortNodes')),
 ('Features',
  'node-features',
  ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')),
 ('Search', 'search', ('S Search',)),
 ('Text', 'text', ('T Text',))]

As a result, you have an overview of the names you can use.

## Exploration

Finally, let's explore this set by means of Text-Fabric.

### Frequency list

We can get ordered frequency lists for the values of all features.

First the words:

In [22]:
F.letters.freqList()

(('the', 8),
 ('of', 5),
 ('and', 4),
 ('in', 3),
 ('we', 3),
 ('everything', 2),
 ('know', 2),
 ('most', 2),
 ('ones', 2),
 ('patterns', 2),
 ('us', 2),
 ('Besides', 1),
 ('Culture', 1),
 ('Everything', 1),
 ('So', 1),
 ('a', 1),
 ('about', 1),
 ('aid', 1),
 ('any', 1),
 ('around', 1),
 ('as', 1),
 ('barbarian', 1),
 ('bottom', 1),
 ('can', 1),
 ('care', 1),
 ('climbing', 1),
 ('composed', 1),
 ('control', 1),
 ('dead', 1),
 ('elegant', 1),
 ('enjoyable', 1),
 ('final', 1),
 ('find', 1),
 ('free', 1),
 ('games', 1),
 ('good', 1),
 ('harness', 1),
 ('have', 1),
 ('high', 1),
 ('humans', 1),
 ('impossible', 1),
 ('is', 1),
 ('it', 1),
 ('languages', 1),
 ('left', 1),
 ('life', 1),
 ('line', 1),
 ('make', 1),
 ('mattered', 1),
 ('mountains', 1),
 ('not', 1),
 ('nothing', 1),
 ('our', 1),
 ('over', 1),
 ('own', 1),
 ('problems', 1),
 ('really', 1),
 ('romance', 1),
 ('safety', 1),
 ('societies', 1),
 ('sports', 1),
 ('studying', 1),
 ('such', 1),
 ('take', 1),
 ('terms', 1),
 ('that', 1),

For the node types we can get info by calling this:

In [23]:
C.levels.data

(('book', 99.0, 100, 100),
 ('chapter', 49.5, 101, 102),
 ('sentence', 33.0, 115, 117),
 ('line', 7.666666666666667, 103, 114),
 ('word', 1, 1, 99))

It means that chapters are 49.5 words long on average, and that the chapter nodes are 101 and 102.

And you see that we have 99 words.

# Links

Of course, there is much more to TF.

Have a look through tutorials for several corpora: Hebrew and Syriac Bible, Quran, Uruk Cuneiform.

Navigate from [here](https://nbviewer.jupyter.org/github/annotation/tutorials/tree/master/).

Now conversion is this easy, more corpora will follow.

The docs for conversion are [here](https://annotation.github.io/text-fabric/Create/Convert/).