<img src="https://annotation.github.io/text-fabric/images/tf-small.png">

# Chapter 23 - Corpus Analysis with Text-Fabric

If you work in particular areas of humanities, it may be the case that you work frequently
with the same corpus or set of corpora. You may also want to annotate that corpus with
features of interest for your research. In this case, it is helpful to have a tool that 
allows you to quickly and easily access your text corpus, as well as to easily add features
to certain words, phrases, sentences, etc. that can be retrieved later with simple queries.

[Text-Fabric](https://annotation.github.io/text-fabric/) is a tool build specifically with
these use-cases in mind. In this notebook, we will explore the basic functionality of Text-Fabric,
as well as review how to construct a basic Text-Fabric resource.

## How to model a text?

There are lots of ways to store and model a text corpus. XML, for instance, 
is a hierarchical structure that allows for tree-like embedding of linguistic objects. 
It is the basis for standards such as TEI. But XML is not good at representing discontinuous
items, something we encounter frequently in language:

In [3]:
sentence = 'This sentence which I wrote is here.'

# we can build such a tree like this:
tree_hierarchy = """

<sentence>
    <clause>
    This sentence
        <clause>
        which I wrote
        </clause>
    is here.
    </clause>
</sentence>

"""

# but how do we model this??
sentence2 = 'This sentence—yes, this one—is here'

**XML is a form of in-line markup, where the annotations are mixed in with the text**. 
This makes it not only difficult to model discontinuous items, but also difficult to
quickly select only those items of interest, since one has to first iterate over all
of the annotation.

**Text-Fabric is a form of stand-off markup, where the annotations are stored separately
from the text**. This approach makes it easy to model dicontinuous items. It also maintains
a separation of concerns between the text and the annotation. 

In [5]:
# stand-off markup in Text-Fabric style

#          word number    1     2      3    4    5   6   7
interrupted_sentence = 'This sentence—yes, this one—is here.'

# word mapping to our new sentence objects
# each number corresponds to a word in the text:

object_to_slots = {
    'sentence1': (1, 2, 6, 7),
    'sentence2': (3, 4, 5),
}

Now we've defined our sentences in terms of the atomic items, or slots, they contain. 
In this case, we have arbitrarily chosen a "word" as the slots. But we could have chosen
letters if we wanted. Notice that dividing the text this way allows the discontinuity to 
be reflected, without losing the information that sentence2 sits between the items of 
sentence 1.

We can also assign integers to the new sentences themselves. That will allow us to associate
features with those sentences. 

We need unique ID's then for each sentence. **Let's arbitrarily begin the sentence ID count at the 
number of slots in the corpus +1**. In this case, we have 7 words, so we'll say that ID 8 is the first
sentence, ID 9 is the second, and so on. If we had more slots than 7, we would start the count higher 
of course.

In [6]:
#          word number    1     2      3    4    5   6   7
interrupted_sentence = 'This sentence—yes, this one—is here.'

# word mapping to our new sentence objects
# each number corresponds to a word in the text:

object_to_slots = {
    8: (1, 2, 6, 7),
    9: (3, 4, 5),
}

# to keep track of which integers belong to which
# kinds of objects, we also create a dictionary that 
# stores each integer's object type

object_types = {
    1: 'word',
    2: 'word',
    3: 'word',
    4: 'word',
    5: 'word',
    6: 'word',
    7: 'word',
    8: 'sentence',
    9: 'sentence',
}

Now that we've divided the text into objects, we can begin to
associate features with those objects. We've already seen an example of 
such a feature dictionary above with `object_types`. We can follow the
same logic to assign other features. 

**The surface text itself can be modeled as a feature**. Done in this way,
we can even choose to arbitrarily assign various kinds of transcriptions or
formats to the text.

In [None]:
plain_text = {
    1: 'This ',
    2: 'sentence—',
    3: 'yes, ',
    4: 'this ',
    5: 'one—',
    6: 'is ',
    7: 'here.'
}

plain_text_no_punctuation = {
    1: 'This',
    2: 'sentence',
    3: 'yes',
    4: 'this',
    5: 'one',
    6: 'is',
    7: 'here'
}
punctuation = {
    1: ' ',
    2: '—',
    3: ', ',
    4: ' ',
    5: '—',
    6: ' ',
    7: '.'
}
greek_transcription = {
    1: 'Θις ',
    2: 'σεντενσε—',
    3: 'ιες, ',
    4: 'θις ',
    5: 'ωνε—',
    6: 'ις ',
    7: 'ἑρε.'
}
parts_of_speech = {
    1: 'demonstrative',
    2: 'noun',
    3: 'exclamation',
    4: 'demonstrative',
    5: 'noun',
    6: 'verb',
    7: 'adverb',
}
sentence_types = {
    8: 'normal',
    9: 'interrupting',
}

At this point, the possibilities are endless. Hopefully the benefits of working this way are clear.
If you're a native Greek speaker, perhaps you'd like to use the Greek transcription. In that case,
you don't even need to load the normal text. Or perhaps you're not interested in punctuation, you
can decide to load the text without punctuation. If this were an XML dataset, we'd have to iterate
through every single tag and piece of associated data to clean out what we want. With the Text-Fabric
model, we only do that once: when we construct the text. And from then on the data is ready to go.

# Text-Fabric Data Model 

The presentation above shows the core operating principle behind Text-Fabric,
though some of the terms of different:

| | Basic Components |
|------|------------------------------------------------------------------------------------|
| slot | sequential, atomic item used as a reference point for building the text (word or letter, etc.) |
| node | A unique "ID" number that corresponds to either a slot, or an item defined by a range of slots. |
| edge | a relationship between two nodes                                                   |
| feature | data associated with a given node |

With these components, we can model and represent a whole text. Text-Fabric handles all of the administration
for us by writing files which contain all of the relevant mappings and information. To build a Text-Fabric
corpus, we just have to feed the Text-Fabric Python module the right set of dictionaries.

## Getting Ready-made Corpora with Text-Fabric
You can use Text-Fabric with a TF dataset stored locally on your machine. But Text-Fabric
also has the capability of retrieving ready-made corpora stored in its public Github
repository. Here are the set of corpora that are currently stored in the repository:

| acronym       | language/writing system | name                             | period           | description                                                       | converted by                         |
|---------------|-------------------------|----------------------------------|------------------|-------------------------------------------------------------------|--------------------------------------|
| athenaeus     | Greek                   | Works of Athenaeus               | 80 - 170         | Deipnosophistae                                                   | Ernst Boogert                        |
| banks         | modern english          | Iain M. Banks                    | 1984 - 1987      | 99 words from the SF novel Consider Phlebas                       | Dirk Roorda                          |
| bhsa          | Hebrew                  | Hebrew Bible                     | 1000 BC - 900 AD | Biblia Hebraica Stuttgartensia (Amstelodamensis)                  | Dirk Roorda + ETCBC                  |
| dss           | Hebrew                  | Dead Sea Scrolls                 | 300 BC - 100 AD  | Transcriptions with morphology based on Martin Abegg's data files | Dirk Roorda, Jarod Jacobs            |
| nena          | Aramaic                 | North Eastern Neo-Aramaic Corpus | 2000-on          | Nena Cambridge                                                    | Cody Kingham                         |
| oldbabylonian | Akkadian / cuneiform    | Old Babylonian letters           | 1900 - 1600 BC   | Altbabylonische Briefe in Umschrift und Übersetzung               | Dirk Roorda, Cale Johnson            |
| peshitta      | Syriac                  | Syriac Old Testament             | 1000 BC - 900 AD | Vetus Testamentum Syriace                                         | Dirk Roorda, Hannes Vlaardingerbroek |
| quran         | Arabic                  | Quran                            | 600 - 900        | Quranic Arabic Corpus                                             | Dirk Roorda, Cornelis van Lit        |
| syrnt         | Syriac                  | Syriac New Testament             | 0 - 1000         | Novum Testamentum Syriace                                         | Dirk Roorda, Hannes Vlaardingerbroek |
| tisch         | Greek                   | New Testament                    | 50 - 450         | Greek New Testament in Tischendorf 8thEdition                     | Cody Kingham                         |
| uruk          | proto-cuneiform         | Uruk                             | 4000 - 3100 BC   | Archaic tablets from Uruk                                         | Dirk Roorda, Cale Johnson            |

See the links to all the corpora [here](https://annotation.github.io/text-fabric/About/Corpora/). 
Most of the ready-made corpora are currently ancient languages, but the Text-Fabric library is 
always expanding.

We will load Text-Fabric now and begin working with the Syriac New Testament.

In [7]:
# If you don't have Text-Fabric installed,
# uncomment below and run, and it should work

#! pip install text-fabric

In [8]:
from tf.app import use

syrnt = use('syrnt') # load the corpus using the ready-made method

rate limit is 60 requests per hour, with 58 left for this hour


To increase the rate,see https://annotation.github.io/text-fabric/Api/Repo/


	connecting to online GitHub repo annotation/app-syrnt ... connected
	code/__init__.py...downloaded
	code/app.py...downloaded
	code/config.py...downloaded
	code/static...directory
		code/static/display.css...downloaded
		code/static/logo.png...downloaded
	OK
Using TF-app in /Users/cody/text-fabric-data/annotation/app-syrnt/code:
	rv0.2=#677bff6df32d1052d521534a25aa7eb5ff8cacfd (latest release)
rate limit is 60 requests per hour, with 41 left for this hour


To increase the rate,see https://annotation.github.io/text-fabric/Api/Repo/


	connecting to online GitHub repo etcbc/syrnt ... connected
Using data in /Users/cody/text-fabric-data/etcbc/syrnt/tf/0.1:
	r0.3=#a997b71cacc2c8b342c78a1039339233d55510fa (latest release)
   |     0.00s No structure info in otext, the structure part of the T-API cannot be used


## `tf` Data Location

Text-Fabric has now downloaded the corpus to your machine and loaded it into memory.

**Note the links above lead to documentation about the features particular to this corpus.**
Text-Fabric does not prescribe any kind of features. A feature is arbitrarilly created for
a given corpus. So this documentation is important for knowing how to interact with that corpus.

Let's find out where this data is loaded.

In [16]:
print(syrnt.mLocations)

['/Users/cody/text-fabric-data/etcbc/syrnt/tf']


We can also see the version of the dataset we've loaded below.

In [24]:
syrnt.version

'0.1'

**The file path above + the version tells you where the Text-Fabric data files
have been downloaded.**

In [28]:
path_to_versions = syrnt.mLocations[0]
version = '/' + syrnt.version
path_to_data = path_to_versions + version

print(path_to_data)

/Users/cody/text-fabric-data/etcbc/syrnt/tf/0.1


## TF Data Files

Let's have a look at the files that have been 
downloaded. 

We'll use a set of terminal commands to peek into that 
folder. *Don't worry if you don't understand the terminal commands.*

You could also navigate manually to the location indicated by the file path
and have a look yourself.

In [29]:
! ls $path_to_data

__checkout__.txt ntyp.tf          root.tf          stem_etcbc.tf
book.tf          nu.tf            root_etcbc.tf    stem_sedra.tf
book@en.tf       oslots.tf        root_sedra.tf    suffix.tf
chapter.tf       otext.tf         seyame.tf        suffix_etcbc.tf
demcat.tf        otype.tf         sfcontract.tf    suffix_sedra.tf
fmhdot.tf        prefix.tf        sfgn.tf          verse.tf
gn.tf            prefix_etcbc.tf  sfnu.tf          vs.tf
lexeme.tf        prefix_sedra.tf  sfps.tf          vt.tf
lexeme_etcbc.tf  prtyp.tf         sp.tf            word.tf
lexeme_sedra.tf  ps.tf            st.tf            word_etcbc.tf
nmtyp.tf         ptctyp.tf        stem.tf          word_sedra.tf


### otype ("object type")

We see a bunch of files with the `tf` extension. These are all files 
which contain Text-Fabric formatted data. Let's have a look at the file
called `otype`. 

In [40]:
! cat $path_to_data/otype.tf

@node
@dataset=syrnt
@datasetName=Syriac New Testament
@description=?
@email1=dirk.roorda@dans.knaw.nl
@encoders=Ancient Biblical Manuscript Center  (transcription),George A. Kiraz and James W. Bennett (database)and Hannes Vlaardingerbroek and Dirk Roorda (TF)
@source=SEDRA
@sourceUrl=https://sedra.bethmardutho.org/about/contributors
@valueType=str
@writtenBy=Text-Fabric
@dateWritten=2018-10-18T09:38:33Z

1-109640	word
109641-109667	book
109668-109927	chapter
109928-112965	lexeme
112966-120922	verse


This is the entire contents of the file. The top of a TF file contains 
metadata tags, marked with `@` symbols, followed by the name of the field,
and the metadata.

Below the metadata we can see the actual data itself. The first column
contains a range of node ID numbers, and the second column contains a 
string associated with those numbers, telling us—in this case—the "otype"
or object type of each of these nodes.

**Can you guess which of these items is the slots?**

### oslots ("object slots")

Above in our mock dataset we mapped tuples of slots to sentence
IDs. The `oslots` file does something similar to this, mapping a 
given node number to a range of corresponding slots.

In [88]:
! head -20 $path_to_data/oslots.tf; echo "\n-- first 20 lines of file --"

@edge
@dataset=syrnt
@datasetName=Syriac New Testament
@description=?
@email1=dirk.roorda@dans.knaw.nl
@encoders=Ancient Biblical Manuscript Center  (transcription),George A. Kiraz and James W. Bennett (database)and Hannes Vlaardingerbroek and Dirk Roorda (TF)
@source=SEDRA
@sourceUrl=https://sedra.bethmardutho.org/about/contributors
@valueType=str
@writtenBy=Text-Fabric
@dateWritten=2018-10-18T09:38:38Z

109641	1-13979
13980-22772
22773-38006
38007-50414
50415-65797
65798-71565
71566-77551
77552-81288

-- first 20 lines of file --


Note that the file begins with one column, but then switches to a single column.
**That is because the Text-Fabric data format optimizes storage by *inferring* the node numbers
when one node follows subsequently from another.** 

For example, above the first line, 

```
109641	1-13979
``` 

defines the range of slots belonging to node `109641`. The second line:

```
13980-22772
```

Defines the range of slots belonging to `109642`, which is inferred from
the fact it immediately follows `109642`. If there is interruption from one
node to the next (i.e. they are not subsequent) then the second column will 
briefly re-appear to update the position. 

Note also that we can look up the object type of `109641` in the `otype` file
referred to further above. It is a book node.

#### feature files

Let's have a look at a different file. word.tf, which we can see from the 
[documentation](https://github.com/etcbc/syrnt/blob/master/docs/transcription-0.1.md#node-type-word)
contains the plain text of a word.

In [89]:
! head -25 $path_to_data/word.tf; echo "\n-- first 25 lines of file --"

@node
@dataset=syrnt
@datasetName=Syriac New Testament
@description=full form of the word in syriac script
@email1=dirk.roorda@dans.knaw.nl
@encoders=Ancient Biblical Manuscript Center  (transcription),George A. Kiraz and James W. Bennett (database)and Hannes Vlaardingerbroek and Dirk Roorda (TF)
@source=SEDRA
@sourceUrl=https://sedra.bethmardutho.org/about/contributors
@valueType=str
@writtenBy=Text-Fabric
@dateWritten=2018-10-18T09:38:37Z

ܟܬܒܐ
ܕܝܠܝܕܘܬܗ
ܕܝܫܘܥ
ܡܫܝܚܐ
ܒܪܗ
ܕܕܘܝܕ
ܒܪܗ
ܕܐܒܪܗܡ
ܐܒܪܗܡ
ܐܘܠܕ
ܠܐܝܣܚܩ
ܐܝܣܚܩ
ܐܘܠܕ

-- first 25 lines of file --


The file contains the plain text features corresponding to word nodes.

If the first item in the file is associated with node 1, there is no need
to write "1". Text-Fabric can simply assume the count.

# Interacting with a TF corpus

Text-Fabric is not just a data model, but also a Python library that interacts with
that model in efficient ways. The library supplies a set of Python objects that can
be used to iterate over nodes, select features on those nodes, or compare relations
between the nodes. 

The Text-Fabric objects are made available through the `api`.

#### [Read about the Text-Fabric API here](https://annotation.github.io/text-fabric/Api/Fabric/#text-fabric-api)

Here are the basic objects for interacting with a loaded Text-Fabric corpus:

| object | what it does                          |
|--------|---------------------------------------|
| N      | gives access to nodes                 |
| F      | gives access to node features         |
| L      | retrieves embedding or embedded nodes |
| T      | retrieves text and section markers    |
| E      | retrieves edge data from a node       |

Read about [other objects here](https://annotation.github.io/text-fabric/Api/Fabric/#loading).

In [57]:
dir(syrnt.api)

['AllComputeds',
 'AllEdges',
 'AllFeatures',
 'C',
 'Call',
 'Computed',
 'ComputedString',
 'Cs',
 'E',
 'Eall',
 'Edge',
 'EdgeString',
 'Es',
 'F',
 'Fall',
 'Feature',
 'FeatureString',
 'Fs',
 'L',
 'Locality',
 'N',
 'Nodes',
 'S',
 'Search',
 'T',
 'TF',
 'Text',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'cache',
 'ensureLoaded',
 'error',
 'ignored',
 'indent',
 'info',
 'isSilent',
 'loadLog',
 'makeAvailableIn',
 'otypeRank',
 'reset',
 'setSilent',
 'silentOff',
 'silentOn',
 'sortKey',
 'sortKeyTuple',
 'sortNodes',

## Access all nodes

Let's start off with simple node interactions.

In [58]:
N = syrnt.api.N

In [60]:
number_of_nodes = 0

for node in N():
    number_of_nodes += 1
    
print(number_of_nodes)

120922


We've now iterated through all of the nodes in the corpus and counted them.

## Access features of nodes

If we want to know the `otype` (object type) of each node, we need the 
`F` (feature) object.

In [61]:
F = syrnt.api.F

### [See F documentation](https://annotation.github.io/text-fabric/Api/Features/)

We can get a feature value with the following pattern:

```

F.feature_name.v(node)

```

Remember that `1` will be the first slot in the corpus. Let's call the 
feature `otype` on this slot.

In [62]:
F.otype.v(1)

'word'

Now let's apply this method to our count and count object types.

In [65]:
number_of_nodes = 0
otype_counts = {}

for node in N():
    
    # count the node and get the type
    number_of_nodes += 1
    otype = F.otype.v(node)
    
    # count the otype in the dictionary
    if otype in otype_counts:
        otype_counts[otype] += 1
    else:
        otype_counts[otype] = 1
    
print(number_of_nodes)

print()
for otype,count in otype_counts.items():
    print(otype, 'count is', count)

120922

book count is 27
chapter count is 260
verse count is 7957
lexeme count is 3038
word count is 109640


## Get a generator for specific feature types

What if we're only interested in, say, the word nodes? 

We can use the following syntax to get a generator object (which we can
convert to a list or simply loop over) that yields only certain feature types.

```
F.feature_name.s('feature_value')
```

In [66]:
word_counts = 0

for word in F.otype.s('word'):
    word_counts += 1
    
print(word_counts)

109640


We could also do this for other features. Below we only iterate over the nodes
that have a part of speech feature (sp) of `verb`.

In [73]:
verb_count = 0

for word in F.sp.s('verb'):
    verb_count += 1

print(verb_count)

30441


## Access embedders or embedded nodes

As we saw above in the `oslots` file, the first book node in the corpus
contains a large range of slots:

```
109641    1-13979
```

But there are also nodes contained within the book that contain smaller
ranges of slots, such as chapters.

For instance, the first chapter in this corpus contains slots 1-290. In 
other words, according to the slots, the chapter is embedded in the book.

**If a node's range of slots intersects with, and is smaller than, another
node's, it is embedded in that node; and if it intersects and is bigger than
another node's, it embeds that node**.

For a more expansive definition, see the documentation.

We can access embedded nodes using Text-Fabric's `L` object using the following syntax:

#### [See L documentation](https://annotation.github.io/text-fabric/Api/Locality/#locality)

```
L.u(node, "embedder_otype")
```
or
```
L.d(node, "embedded_otype")
```

Some examples:

In [91]:
L = syrnt.api.L

In [94]:
first_book_chapters = L.d(109641, 'chapter')

print(first_book_chapters)
print()
print('first book contains', len(first_book_chapters), 'chapters')

(109668, 109669, 109670, 109671, 109672, 109673, 109674, 109675, 109676, 109677, 109678, 109679, 109680, 109681, 109682, 109683, 109684, 109685, 109686, 109687, 109688, 109689, 109690, 109691, 109692, 109693, 109694, 109695)

first book contains 28 chapters


Let's retrieve the first chapter, starting with a slot this time.

In [95]:
first_chapter = L.u(1, 'chapter')

first_chapter

(109668,)

Note that even when an item is embedded by only 1 node, `L` still
returns a tuple. So we need to index the tuple to retrieve the node number
of the first chapter.

In [96]:
first_chapter_node = first_chapter[0]

And we can go back down to the words, confirming that slot `1` is contained in this chapter:

In [98]:
print(L.d(first_chapter_node, 'word'))

(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 22

## Get text for a node

The `T` object provides a convenient way to quickly access the surface text 
of any given node in the dataset.

### [See the T documentation](https://annotation.github.io/text-fabric/Api/Text/)

```
T.text(node, fmt="format_here")
```

Note that `fmt` is an optional argument, and the string value that accompanies it will
be unique to the particular corpus you're working with. It allows you to specific alternative
representations of the text such as transcriptions. You can also ignore it if you
only want the default representation.

In [101]:
T = syrnt.api.T

In [104]:
print(T.text(1))

ܟܬܒܐ 


In [105]:
first_verse = L.u(1, 'verse')[0]

print(T.text(first_verse))

ܟܬܒܐ ܕܝܠܝܕܘܬܗ ܕܝܫܘܥ ܡܫܝܚܐ ܒܪܗ ܕܕܘܝܕ ܒܪܗ ܕܐܒܪܗܡ 


In [106]:
print(T.text(first_chapter_node))

ܟܬܒܐ ܕܝܠܝܕܘܬܗ ܕܝܫܘܥ ܡܫܝܚܐ ܒܪܗ ܕܕܘܝܕ ܒܪܗ ܕܐܒܪܗܡ ܐܒܪܗܡ ܐܘܠܕ ܠܐܝܣܚܩ ܐܝܣܚܩ ܐܘܠܕ ܠܝܥܩܘܒ ܝܥܩܘܒ ܐܘܠܕ ܠܝܗܘܕܐ ܘܠܐܚܘܗܝ ܝܗܘܕܐ ܐܘܠܕ ܠܦܪܨ ܘܠܙܪܚ ܡܢ ܬܡܪ ܦܪܨ ܐܘܠܕ ܠܚܨܪܘܢ ܚܨܪܘܢ ܐܘܠܕ ܠܐܪܡ ܐܪܡ ܐܘܠܕ ܠܥܡܝܢܕܒ ܥܡܝܢܕܒ ܐܘܠܕ ܠܢܚܫܘܢ ܢܚܫܘܢ ܐܘܠܕ ܠܣܠܡܘܢ ܣܠܡܘܢ ܐܘܠܕ ܠܒܥܙ ܡܢ ܪܚܒ ܒܥܙ ܐܘܠܕ ܠܥܘܒܝܕ ܡܢ ܪܥܘܬ ܥܘܒܝܕ ܐܘܠܕ ܠܐܝܫܝ ܐܝܫܝ ܐܘܠܕ ܠܕܘܝܕ ܡܠܟܐ ܕܘܝܕ ܐܘܠܕ ܠܫܠܝܡܘܢ ܡܢ ܐܢܬܬܗ ܕܐܘܪܝܐ ܫܠܝܡܘܢ ܐܘܠܕ ܠܪܚܒܥܡ ܪܚܒܥܡ ܐܘܠܕ ܠܐܒܝܐ ܐܒܝܐ ܐܘܠܕ ܠܐܣܐ ܐܣܐ ܐܘܠܕ ܠܝܗܘܫܦܛ ܝܗܘܫܦܛ ܐܘܠܕ ܠܝܘܪܡ ܝܘܪܡ ܐܘܠܕ ܠܥܘܙܝܐ ܥܘܙܝܐ ܐܘܠܕ ܠܝܘܬܡ ܝܘܬܡ ܐܘܠܕ ܠܐܚܙ ܐܚܙ ܐܘܠܕ ܠܚܙܩܝܐ ܚܙܩܝܐ ܐܘܠܕ ܠܡܢܫܐ ܡܢܫܐ ܐܘܠܕ ܠܐܡܘܢ ܐܡܘܢ ܐܘܠܕ ܠܝܘܫܝܐ ܝܘܫܝܐ ܐܘܠܕ ܠܝܘܟܢܝܐ ܘܠܐܚܘܗܝ ܒܓܠܘܬܐ ܕܒܒܠ ܡܢ ܒܬܪ ܓܠܘܬܐ ܕܝܢ ܕܒܒܠ ܝܘܟܢܝܐ ܐܘܠܕ ܠܫܠܬܐܝܠ ܫܠܬܐܝܠ ܐܘܠܕ ܠܙܘܪܒܒܠ ܙܘܪܒܒܠ ܐܘܠܕ ܠܐܒܝܘܕ ܐܒܝܘܕ ܐܘܠܕ ܠܐܠܝܩܝܡ ܐܠܝܩܝܡ ܐܘܠܕ ܠܥܙܘܪ ܥܙܘܪ ܐܘܠܕ ܠܙܕܘܩ ܙܕܘܩ ܐܘܠܕ ܠܐܟܝܢ ܐܟܝܢ ܐܘܠܕ ܠܐܠܝܘܕ ܐܠܝܘܕ ܐܘܠܕ ܠܐܠܝܥܙܪ ܐܠܝܥܙܪ ܐܘܠܕ ܠܡܬܢ ܡܬܢ ܐܘܠܕ ܠܝܥܩܘܒ ܝܥܩܘܒ ܐܘܠܕ ܠܝܘܣܦ ܓܒܪܗ ܕܡܪܝܡ ܕܡܢܗ ܐܬܝܠܕ ܝܫܘܥ ܕܡܬܩܪܐ ܡܫܝܚܐ ܟܠܗܝܢ ܗܟܝܠ ܫܪܒܬܐ ܡܢ ܐܒܪܗܡ ܥܕܡܐ ܠܕܘܝܕ ܫܪܒܬܐ ܐܪܒܥܣܪܐ ܘܡܢ ܕܘܝܕ ܥܕܡܐ ܠܓܠܘܬܐ ܕܒܒܠ ܫܪܒܬܐ ܐܪܒܥܣܪܐ ܘܡܢ ܓܠܘܬܐ ܕܒܒܠ ܥܕܡܐ ܠܡܫܝܚܐ ܫܪܒܬܐ ܐܪܒܥܣܪܐ ܝܠܕܗ ܕܝܢ 

Additional formats are defined in the `otext` file with a series of metadata statements:

In [99]:
! cat $path_to_data/otext.tf

@config
@dataset=syrnt
@datasetName=Syriac New Testament
@email1=dirk.roorda@dans.knaw.nl
@encoders=Ancient Biblical Manuscript Center  (transcription),George A. Kiraz and James W. Bennett (database)and Hannes Vlaardingerbroek and Dirk Roorda (TF)
@fmt:lex-orig-full={lexeme} 
@fmt:lex-trans-full={lexeme_etcbc} 
@fmt:text-orig-full={word} 
@fmt:text-trans-full={word_etcbc} 
@sectionFeatures=book,chapter,verse
@sectionTypes=book,chapter,verse
@source=SEDRA
@sourceUrl=https://sedra.bethmardutho.org/about/contributors
@writtenBy=Text-Fabric
@dateWritten=2018-10-18T09:38:38Z



The relevant lines here are those beginning with `@fmt`. Note the names `lex-orig-full` and `lex-trans-full`. 
The subsequent values `{lexeme}` and `{lexeme_etcbc}` tell which node features are used to compile
the text format. 

### [You can read about how a fmt string should be written here](https://annotation.github.io/text-fabric/Api/Text/#text-representation)

Let's have a look at the two alternative formats.

In [107]:
T.text(first_chapter_node, fmt='lex-orig-full')

'ܟܬܒܐ ܝܠܝܕܘܬܐ ܝܫܘܥ ܡܫܝܚܐ ܒܪܐ ܕܘܝܕ ܒܪܐ ܐܒܪܗܡ ܐܒܪܗܡ ܝܠܕ ܐܝܣܚܩ ܐܝܣܚܩ ܝܠܕ ܝܥܩܘܒ ܝܥܩܘܒ ܝܠܕ ܝܗܘܕܐ ܐܚܐ ܝܗܘܕܐ ܝܠܕ ܦܪܨ ܙܪܚ ܡܢ ܬܡܪ ܦܪܨ ܝܠܕ ܚܨܪܘܢ ܚܨܪܘܢ ܝܠܕ ܐܪܡ ܐܪܡ ܝܠܕ ܥܡܝܢܕܒ ܥܡܝܢܕܒ ܝܠܕ ܢܚܫܘܢ ܢܚܫܘܢ ܝܠܕ ܣܠܡܘܢ ܣܠܡܘܢ ܝܠܕ ܒܥܙ ܡܢ ܪܚܒ ܒܥܙ ܝܠܕ ܥܘܒܝܕ ܡܢ ܪܥܘܬ ܥܘܒܝܕ ܝܠܕ ܐܝܫܝ ܐܝܫܝ ܝܠܕ ܕܘܝܕ ܡܠܟܐ ܕܘܝܕ ܝܠܕ ܫܠܝܡܘܢ ܡܢ ܐܢܬܬܐ ܐܘܪܝܐ ܫܠܝܡܘܢ ܝܠܕ ܪܚܒܥܡ ܪܚܒܥܡ ܝܠܕ ܐܒܝܐ ܐܒܝܐ ܝܠܕ ܐܣܐ ܐܣܐ ܝܠܕ ܝܗܘܫܦܛ ܝܗܘܫܦܛ ܝܠܕ ܝܘܪܡ ܝܘܪܡ ܝܠܕ ܥܘܙܝܐ ܥܘܙܝܐ ܝܠܕ ܝܘܬܡ ܝܘܬܡ ܝܠܕ ܐܚܙ ܐܚܙ ܝܠܕ ܚܙܩܝܐ ܚܙܩܝܐ ܝܠܕ ܡܢܫܐ ܡܢܫܐ ܝܠܕ ܐܡܘܢ ܐܡܘܢ ܝܠܕ ܝܘܫܝܐ ܝܘܫܝܐ ܝܠܕ ܝܘܟܢܝܐ ܐܚܐ ܓܠܘܬܐ ܒܒܠ ܡܢ ܒܬܪ ܓܠܘܬܐ ܕܝܢ ܒܒܠ ܝܘܟܢܝܐ ܝܠܕ ܫܠܬܐܝܠ ܫܠܬܐܝܠ ܝܠܕ ܙܘܪܒܒܠ ܙܘܪܒܒܠ ܝܠܕ ܐܒܝܘܕ ܐܒܝܘܕ ܝܠܕ ܐܠܝܩܝܡ ܐܠܝܩܝܡ ܝܠܕ ܥܙܘܪ ܥܙܘܪ ܝܠܕ ܙܕܘܩ ܙܕܘܩ ܝܠܕ ܐܟܝܢ ܐܟܝܢ ܝܠܕ ܐܠܝܘܕ ܐܠܝܘܕ ܝܠܕ ܐܠܝܥܙܪ ܐܠܝܥܙܪ ܝܠܕ ܡܬܢ ܡܬܢ ܝܠܕ ܝܥܩܘܒ ܝܥܩܘܒ ܝܠܕ ܝܘܣܦ ܓܒܪܐ ܡܪܝܡ ܡܢ ܝܠܕ ܝܫܘܥ ܩܪܐ ܡܫܝܚܐ ܟܠ ܗܟܝܠ ܫܪܒܬܐ ܡܢ ܐܒܪܗܡ ܥܕܡܐ ܕܘܝܕ ܫܪܒܬܐ ܐܪܒܥܣܪ ܡܢ ܕܘܝܕ ܥܕܡܐ ܓܠܘܬܐ ܒܒܠ ܫܪܒܬܐ ܐܪܒܥܣܪ ܡܢ ܓܠܘܬܐ ܒܒܠ ܥܕܡܐ ܡܫܝܚܐ ܫܪܒܬܐ ܐܪܒܥܣܪ ܝܠܕܐ ܕܝܢ ܝܫܘܥ ܡܫܝܚܐ ܗܟܢܐ ܗܘܐ ܟܕ ܡܟܪ ܗܘܐ ܡܪܝܡ ܐܡܐ ܝܘܣܦ ܥܕܠܐ ܫܘܬܦ ܫܟܚ ܒܛܢܬܐ ܡܢ ܪܘܚܐ ܩܘܕܫܐ ܝܘܣܦ ܕܝܢ ܒܥܠܐ ܟܐܢܐ ܗܘܐ ܠܐ ܨܒܐ ܦܪܣܝ ܪܥ

Look closely and you'll notice this text is different from the one above, written now using only lexical forms.

In [108]:
T.text(first_chapter_node, fmt='lex-trans-full')

'KTB> JLJDWT> JCW< MCJX> BR> DWJD BR> >BRHM >BRHM JLD >JSXQ >JSXQ JLD J<QWB J<QWB JLD JHWD> >X> JHWD> JLD PRY ZRX MN TMR PRY JLD XYRWN XYRWN JLD >RM >RM JLD <MJNDB <MJNDB JLD NXCWN NXCWN JLD SLMWN SLMWN JLD B<Z MN RXB B<Z JLD <WBJD MN R<WT <WBJD JLD >JCJ >JCJ JLD DWJD MLK> DWJD JLD CLJMWN MN >NTT> >WRJ> CLJMWN JLD RXB<M RXB<M JLD >BJ> >BJ> JLD >S> >S> JLD JHWCPV JHWCPV JLD JWRM JWRM JLD <WZJ> <WZJ> JLD JWTM JWTM JLD >XZ >XZ JLD XZQJ> XZQJ> JLD MNC> MNC> JLD >MWN >MWN JLD JWCJ> JWCJ> JLD JWKNJ> >X> GLWT> BBL MN BTR GLWT> DJN BBL JWKNJ> JLD CLT>JL CLT>JL JLD ZWRBBL ZWRBBL JLD >BJWD >BJWD JLD >LJQJM >LJQJM JLD <ZWR <ZWR JLD ZDWQ ZDWQ JLD >KJN >KJN JLD >LJWD >LJWD JLD >LJ<ZR >LJ<ZR JLD MTN MTN JLD J<QWB J<QWB JLD JWSP GBR> MRJM MN JLD JCW< QR> MCJX> KL HKJL CRBT> MN >BRHM <DM> DWJD CRBT> >RB<SR MN DWJD <DM> GLWT> BBL CRBT> >RB<SR MN GLWT> BBL <DM> MCJX> CRBT> >RB<SR JLD> DJN JCW< MCJX> HKN> HW> KD MKR HW> MRJM >M> JWSP <DL> CWTP CKX BVNT> MN RWX> QWDC> JWSP DJN B<L> K>N> HW> L> YB> PRSJ R<

This is an ASCII trasncription version of the text.

## Get Section Data with T

T can also be used to go back and forth from various section data to
various nodes. 


### To go from a section to a node

```
T.nodeFromSection((section1, section1.1, section1.2))
```

Let's say we are interested in a given book, in this case the book 
of "Hebrews". We can select the book simply wiht `T.nodeFromSection`
by feeding it a tuple.

In [111]:
hebrews = T.nodeFromSection(('Hebrews',))

print(hebrews)

109659


We can further specify chapter and verse like this:


In [113]:
hebrews_1_1 = T.nodeFromSection(('Hebrews', 1, 1))

print(hebrews_1_1)

119785



### To go from a node to a section

```
T.sectionFromNode(node)
```

If we want the section data from a node instead, we can use this function.

In [114]:
random_slot = 21342

print(T.sectionFromNode(random_slot))

('Mark', 14, 22)


## Pretty representation of objects

Text-Fabric also can represent nodes with formatted HTML to facilitate 
the data exploration process. 

### [Read about pretty methods here](https://annotation.github.io/text-fabric/apidocs/html/tf/applib/display.html#tf.applib.display.plain)

In [134]:
pretty = syrnt.pretty
plain = syrnt.plain
prettyTuple = syrnt.prettyTuple

In [135]:
pretty(1)

In [136]:
pretty(first_verse)

In [139]:
prettyTuple((first_verse, 1), seq=0) # seq = result number

Note that with `prettyTuple`, we get highlighting behavior when an 
embedded node is included in the tuple.

# Writing Text-Fabric Queries

You can choose to maneuver the corpus programmatically with Python classes as we've 
seen above. But you can also use the `Text-Fabric` query language, Search, which is useful
for quickly writing patterns that express relationships within the text.

Note that TF Search has its own syntax, which you can read about below.

### [Read about the syntax for TF Search](https://annotation.github.io/text-fabric/Use/Search/)

The syntax is relatively straightforward: we specify nodes or features of those 
nodes, and we can specific embedding relations by indenting one node underneath another.
And we can specify particular sequences between nodes embedded at the same level.

In [121]:
search = syrnt.search # NB: not stored under standard API

In [133]:
verb_query = search('word sp=verb') # query in the string

  0.17s 30441 results


In [123]:
type(verb_query)

list

In [124]:
verb_query[:10]

[(4,), (10,), (13,), (16,), (20,), (26,), (29,), (32,), (35,), (38,)]

Note that the results of the search is returned as tuples of node numbers.

We can take these nodes and do normal Text-Fabric things with them, as I showed
above.

We can also visualize the query results with the show method, which applies the
pretty methods to a list of tuples.

In [125]:
show = syrnt.show

In [129]:
show(verb_query[:3]) # show first 3

And we can also write more advanced queries:

In [131]:
# write the query in a large string
# indentation specifies embedding into verse
# <: specifies adjacent order between the two words
noun_verb_query = """

verse
    word sp=noun
    <: word sp=verb

"""

noun_verb_results = search(noun_verb_query)

  0.34s 8491 results


Note that we can also use the optional argument `end` to restrict the `show` method to a certain 
number of results.

In [132]:
show(noun_verb_results, end=5)

## Building a Text-Fabric Corpus

We have been working with a ready-made Text-Fabric Corpus. But you can also
build your own with whatever text you already have. Text-Fabric also has 
a class for compiling and saving Text-Fabric files. 

### [See the tutorial to make your own corpus](https://nbviewer.jupyter.org/github/annotation/banks/blob/master/programs/convert.ipynb)