<img align="right" src="images/tf.png" width="128"/>

# Appendix TF: Text-Fabric

-- *A Python Course for the Humanities by Folgert Karsdorp and Maarten van Gompel*

-- this appendix by Dirk Roorda

---

In this appendix we point you to an approach of radical pre-processing of text corpora.
Pre-processing is tedious and time consuming and error prone.
It takes the focus away from your original research question. 
Yet it must be done, otherwise your tools will not work or give suboptimal results.

When several researchers fire multiple research questions at the same corpus, they are likely to
perform pre-processing steps on their own, geared towards their particular questions.
In doing so they duplicate a fair bit of work.

Yet it is not easy to prevent that, because input data tends to be messy and research questions almost by
definition require new methods and tools. So where is the common ground?

## What is Text-Fabric?

[Text-Fabric](https://github.com/annotation/text-fabric) is a tool that

* defines the
  [common ground](https://annotation.github.io/text-fabric/Model/Data-Model/)
  by viewing text as a graph whose nodes and edges are annotated by features
  (even the text itself is a feature, or several, if there are multiple ways for representing text)
* defines a
  [file format](https://annotation.github.io/text-fabric/Model/File-formats/)
  `.tf` to store individual feature data [efficiently](https://annotation.github.io/text-fabric/Model/Optimizations/)
* provides an
  [API](https://annotation.github.io/text-fabric/Api/Fabric/)
  to access all data in a text in a convenient way, including search by graph templates.

## TF apps

While this is common ground enough for many applications, there is also the concept of *apps*.

A [TF-app](https://annotation.github.io/text-fabric/Api/App/)
is a bundle of settings and a few implemented functions in which the particulars of a corpus
can be defined. Things like the script, the details of text representation, and the online locations of the data.

A corpus that is supported by an app can be viewed in the Text-Fabric browser in a way comparable as you view a Jupyter notebook
in your browser. If you have done `pip install text-fabric`, you can start up a browser interface on your corpus by e.g.

```sh
text-fabric dss
```

In this case you get the `dss` corpus.

And inside programs, you can start working with a corpus just like this:

```python
from tf.app import use

A = use('dss')
```

after which `A` is your handle to all data inside the corpus.

The list of
[corpora](https://annotation.github.io/text-fabric/About/Corpora/)
supported by an app is growing, and you are encouraged to provide an app for your own corpus.

## Conversion

There is a toy 99-word corpus,
[banks](https://nbviewer.jupyter.org/github/annotation/banks/blob/master/programs/convert.ipynb)
where you can see in detail how you can use TF itself to get your corpus converted to it.

The author of TF has used the same method to get the Dead Sea Scrolls, the Quran, and the Old Babylonian Letter corpus into TF.

And Ernst Boogert has used it to get the works of
[40 church fathers](https://github.com/pthu/patristics)
from Perseus into TF.

## Docs and tutorials

If you have followed the links above, you have already landed on the 
[TF docs](https://annotation.github.io/text-fabric/).

There is also a collection of
[tutorial notebooks](https://nbviewer.jupyter.org/github/annotation/tutorials/tree/master/)
per corpus where you are guided into more and more involved expeditions into these corpora.

## Sharing data and Open Science

A good illustration of how a pos-tagging workflow on an ancient text coud proceed is
[posTag](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/oldbabylonian/cookbook/posTag.ipynb)
where we use rules to identify part-of-speech in Akkadian words.

It is a process of stepwise refinement, and much testing, which is why we publish the intermediate results.
These results themselves are features for the same underlying graph, and since features are stored in individual files,
it is super easy to share them through GitHub.

The domain experts can start their TF-browser as follows:

```sh
text-fabric oldbabylonian --mod=annotation/tutorials/oldbabylonian/cookbook/pos/tf
```

and can immediately query the data using the features that exist in that location on GitHub.

![babex](images/tfbabexample.png)

## Development

Text-Fabric is being developed actively, in response to researchers' needs.

Have a look at the
[project and issues](https://github.com/orgs/annotation/projects/1)
if you want to take part.