<div align="center"><h1>Modeling the Messy Complexity of Texts</h1>
<h3>_Computational approaches to in-text variation_</h3>
<br/>
<h4>Elli Bleeker, Bram Buitendijk, Ronald Haentjens Dekker
    <br/>R&amp;D group -  Royal Dutch Academy of Arts and Sciences</h4>
    <h6>Symposium "Writing and Revision Stages"
    <br/>June 6, 2019 - University of Lisbon</h6>
</div>

**NOTE** 

For my presentation, I made use of the RISE extension of Jupyter Notebook, that allows you to create a reveal.js-based presentation. You can download Jupyter notebook [here](https://jupyter.org/install) and an introduction of RISE [here](https://rise.readthedocs.io/en/stable/index.html). 


You can download my entire talk [here](https://github.com/bleekere/writing-revision). After installing RISE in the same folder as you have downloaded this presentation, you can see it as a slideshow.

The following is a downloaded version of my talk, that reads as one long text. The slightly weird formatting is because of the conversions of the RISE-slides into text.

<div align="center">_"The Holy Grail of computer science is to capture the messy complexity of the natural world and express it algorithmically"_
<br/><br/>(Teresa Marrin Nakra, 2006)
    </div>

<div align="center">How can we capture the messy complexity of literary texts? 
</div>

<div align="center">In what ways can we express our textual knowledge algorithmically?</div>

The framework of this talk is different than most other talks of this symposium: instead of presenting a particular use case, I proceed from a high-level view in which I describe how textual features that are particularly relevant for genetic studies have a better fit into a graph data model for text. I will use different textual fragments to illustrate my argument, and I invite you to consider how our approach will be able to accommodate your cases.

Framework: data models and their influence on our text encoding

Goals:

1. Make clear why the existing approaches are limited; why we should not settle for workarounds
2. Emphasise the strengths and values of existing approaches (the flexibility XML offers its users; the vast amount of tools based on TEI/XML; the large and active user comminity). We do not deny nor ignore these values
3. Evaluate the implications of using TAG to model non-linear text: does TAG perform better? Does the TAG model actually comes close to how human readers understand in-text variation?
4. What the audience needs to take away:
    - A broader understanding of the affordances of data models and how they may serve us best
    - The additional value of using a data model that is closely related to your personal understanding of the object you're modeling
    - The additional value of looking at textual objects from an informational perspective, so as to find the best algorithmic translation
    - How thinking about the best translation also compels you to think about how _you_ understand that specific phenomenon; e.g. is is actually correct to say in-text variation _always_ constitutes non-linear text? Is non-linear text partially ordered information?
    - The strenghts of TAG for modeling complex textual features
    - The challenges of launching a new data model without completely disgarding the existing ones (I am quite aware we are not the first ones to have come up with "the best" way to, well, everything)

# Outline

1. Background: data models for text
1. Encoding in-text revision
    - examples
    - theoretical definition
    - informational definition
    - existing approaches to modeling non-linear text
2. A very brief introduction to the TAG model
    - data model
    - syntax
    - implications
3. Modeling in-text variation in TAG
4. Evaluation
5. Discussion


# Data models for text

The datamodel we use influence our text encoding and defines what textual features we can record. The ways in which we algorithmically express information about text also define how that information can be parsed, queried, and visualised. In short: understanding the strengths and limitations of a data model is vital in order to arrive at a correct text encoding.

The most familiar limitation of the OHCO model behind XML is that it doesn't support structures that do not fit within a single hierarchy, which is known as the "overlap problem". In the context of studying modern manuscripts, the most important limitation of XML is the inability to encode non-hierarchical features, or more precisely: features that do not fit neatly into one hierarchy. Familiar examples are the combination of documentary and textual features, both relevant to genetically oriented scholars. Over the years, alternative data models for text have been proposed (like LMNL or GODDAG), but these data models mainly focused on overcoming the overlap problem. We think that overlap is not _the_ problem, but a symptom of a higher level problem. Less familiar examples of hierarchy-breaking features are, for instance, discontinuous text or nonlinear text, both equally relevant for a genetic approach to text modeling. 

This is why I start with a high level perspective on what information we want to express, and come to a better understanding of what the best way may be. Instead of proposing ways to circumvent the limitations of XML, this presentation describes a powerful new datamodel for expressing information about complex literary texts in an idiomatic way, without having to resort to workarounds or hacks.

Finally, I want to emphasize that in view of the time this presentation focuses on one small (yet significant) aspect of modern manuscripts: their nonlinearity and how to express it in markup. I am fully aware of the existence of other textual features relevant to genetic studies (_inter alia_ the materiality of the document -- lay out, script, binding, the ductus, the paper, etc -- the separation between _Befund_ and _Deutung_, the process of writing, etc.). 

We have at our disposition a number of different data models to express and represent text. I'll give a brief outline of the most common or extraordinary ones. Keep in mind the difference between a data model and a syntax! A syntax is a serialisation of a data model, e.g., XML is a serialisation (= a linearly ordered expression) of the OHCO model, but SGML is as well.

In other words, this XML sentence  

```<root><s>The sun is not yellow</s></root>```

is a serialisation of this tree: 

<img src="images/simple-tree.png">

| Data model || Serialisation ||
|:------:||:--------------------:||
|String || plain text (ranges) ||
|OHCO|| SGML, XML ||
|Key:value pairs || JSON ||
|Graph||Turtle, N-Triple, RDFa, EARMARK... ||

Each of these formats has disadvantages but also their own merits. As Fabio Vitali argued a few years back, with the help of some coding and hacks you can express almost everything in every data model (though it ain't always pretty).

<img src="images/vitali-data-formats.png">

Image by Vitali (2016)

Making use of what is already there has a lot of benefits: it is more or less stable, people know and understand it, there's usually a community, tutorials and tools made to work with that format. The downsides:

1. Models influence the way we think and argue about text.

It's a sneaky one: if we use certain models long enough, they can influence the way we think and argue about text. They can even - very subtly - encourage us to ignore certain features that are not represented in that particular model. Patrick Sahle noted, with regard to TEI,

<div align="center">_"TEI konzentriert sich for allen auf den Text als Werk(-Struktur), als sprachliche Äußerung und als kanonisierte, definierte oder auch variante Fassung. Der Text als intentionale Mitteilung, als semantischer Inhalt, aber auch der Text als physisches Object, als Dokument, wird nur am Rande unterstützt. Der Text als komplexes Zeichen, als semiotic Entität, spielt bei der TEI keine Rolle."_
    <br/>
<br/>
Patrick Sahle, 2013</div>

2. Workarounds and local "solutions" hinder interoperability and reusability

The more complex your texts (or what you want to do with it), the more coding, hacking and workarounds. I am not saying that modeling complex textual features in XML or RDF is impossible. It may lead to a reduced human-readability of the file (which is more important than you may think, especially when it concerns humanities scholars), but you may argue that the file is not intended to be read by humans, as long as machines understand it. However, it also hinders interoperability and delegates a lot of responsibility and complexity to the application level that processes the data.

So, in summary:

<div align="center">A data model influences the text encoding as well as what features we (can) record.</div>

<div align="center">The affordances and limitations of a textual model influence our study of text.</div>

<div align="center">Understanding the strengths and limitations of a data model is crucial.</div>

Now, it is well-known that the _de facto_ data model for text encoding, TEI-based XML, is significantly limited when it comes to expressing textual properties that do not fit naturally within a single-rooted hierarchy. In discussing the TEI's abilities for transcribing texts, Edward Vanhoutte remarked:

<div align="center">_"The real problems arise when dealing with modern manuscript material"_
    <br/>
    <br/>(Edward Vanhoutte 2002)

Seventeen years later, this remark still holds true. After all, that is why we are here today.

## Encoding in-text revisions

### What is in-text revision?
In essence, _in-text_ revisions constitute non-linear text.

### Examples 

(Cf. the overview of textual alterations in the TEI Working Group's _Encoding Model for Genetic Editions_ 2010, §3.2)

#### Deletion

#### Addition

#### Substitution

### Other instantiations of non-linear text

#### Instant corrections

#### Open variants

#### Transpositions

How can we translate these elements so that the computer understands them in the same way we do? Like with any translation, even though one side is non-human, this comes down to finding the right context (the framework), the right words (vocabulary and syntax) and, above all, the right effect.

What effect do we want to realize by translating non-linear text to a computer-readable format?

I am reminded of something Dirk once said about translating textual genetic concepts:

<div align="center">"Instead of smoothing out the textual contingencies of the complex genesis, the translation calls attention to them, enhancing the readers’ textual awareness."
<br/>
<br/>
(Dirk van Hulle, 2015)</div>

### How does non-linear text translate informatically?

"All documents are structured, but some documents are more structured than others."

- unordered information
- ordered information
- partially ordered information 

### Non-linear text = partially ordered information

### Example

#### Substitution

### Example

#### Editorial corrections

- **String (e.g. plain text)**  
One order of tokens; no discontinuity; no overlap

- **Key:value pairs (e.g. JSON)**  
Unordered (order is not informational), hierarchical data is supported but it's not really usable for long texts

- **Tree (e.g. XML)**  
Order is informational unless indicated otherwise; discontinuity is not a problem (with linking); no overlap without workarounds like standoff. XML is a linearization of a tree structure.

- **Graph (e.g. RDF or GODDAG)**  
Graph: there are a lot of different graphs. For most graphs, overlap is not an issue (although it may take some workarounds, see EARMARK re: RDF); GODDAG allows for multiple parenthood and multiple orders of the text tokens (leafs)

For example, the use of TEI-XML elements to represent regularization (orig/reg), correction (sic/corr), or abbreviation (abbr/expan) is ordered in the sense that two XML documents that differ in the order of an orig/reg choice are different XML documents, and that difference can be ignored only at the application level.

## Our approach

### TAG (Text-As-Graph)

**Hypergraph model for text**  

<img align="center" width="600" height="600" src="images/hypergraph-general.png">

(image of handwritten phrase "`Ceci [add>n'<add]est [add>pas<add]une pipe.`"

<img align="center" width="600" height="600" src="images/hypergraph.png">

(several slides explaining the most relevant parts of the hypergraph model for text:)

- TAG data model: non-uniform cyclic property hypergraph of text

(image)

- Markup as sets on other nodes; image of hypergraph. This means that markup can point to multiple text nodes and vice versa

(image)

#### Syntax: TAGML

A syntax is the serialisation of a data model.

Properties of TAGML:

- asymmetrical tags
- not just strings and markup, but other data types as well (numbers or boolean values)
- annotation can be nested

(examples)

### Transcribing non-linear text in TAGML

# Evaluation

- Effects of using TAG to model non-linear text
- Awareness of textual contingencies

What are the effects of using TAG to model non-linear text?

### Multiple perspectives on text

**Layers**

Layers are used to group together a set of hierarchically structured Markup nodes.



### Example
(image)

Reasons for grouping Markup nodes:

- technical
- practical
- collaborating

# Future work

## Additional information

- Writing tool, hand, ductus, topography... As annotations on the `[del>` and `[add>` tags


## Collation

- "How is collation to approach textual data that cannot be undoubtedly assigned to a specific writing stage?"
- "It is however an open question as to whether inter-document discrepancies at the dossier level should be regarded in the same way as intra-document alterations. If two witnesses are collated, we may observe that a word present in one is missing from the other: does it necessarily follow that this is an addition or a deletion, which we would not hesitate to mark with an add or del tag if we are transcribing a single manuscript?" (from the TEI Encoding Model for Genetic Editing)


# Just in case
## More information about TAG

### Implications of the TAG model

This design principle means that TAG processing can support any type of query, from Boolean to ranked pattern matches at the level of the model, and that the complex mixture of information can be parsed and processed in an idiomatic manner and without work-arounds.

Why is this important?

1. Idiomatic modelling
2. Separation of responsibilities

1. When we structure information in a way that is close to how we understand it to be, our means of storing and querying become more powerful. An example: the value of XML attributes is always a string (`<date year="1928"/>`) whereas in TAGML this can be an actual number (`[date year=1928><date]`)

2. TAG is designed according to the principle of separating responsibilities. In XML, most responsibilities are transferred to the application layer; in TAG we organise and structure information at the level of the model. Many textual aspects discussed in this paper can be modeled in, for instance, an XML-transcription with an associated schema and application-level rules. TAGML, however, moves much of that responsibility to the syntax by having explict encoding mechanisms for containment, dominance, discontinuity, non-linearity, and overlap, with the goal of removing ambiguity from the application level. Accordingly, TAGML brings together and expands on qualities of existing formats, and creates an inclusive and definite framework for modeling textual and structural information. I will return to why this is important later.

# OUD

In this talk, I focus on a small but significant aspect of the study of literary manuscripts: **capturing their multidimensional nature**. When we talk about "multidimensial nature" in textual studies we usually mean the combined set of characteristics: textual aspects, material aspects and chronological aspects.

In standard text encoding models that are based on the single-hierarchy of XML, one of these dimensions has to be given priority. Aspects from other dimensions can be recorded but do not have natural place in the data structure.