_"The Holy Grail of computer science is to capture the messy complexity of the natural world and express it algorithmically"_
(Teresa Marrin Nakra, 2006)

How can we capture the messy complexity of literary texts? In what ways can we express our textual knowledge algorithmically?

Framework: data models and their influence on our text encoding

Goals:

1. Make clear why the existing approaches are limited; why we should not settle for workarounds
2. Emphasise the strengths and values of existing approaches (the flexibility XML offers its users; the vast amount of tools based on TEI/XML; the large and active user comminity). We do not deny nor ignore these values
3. Evaluate the implications of using TAG to model non-linear text: does TAG perform better? Does the TAG model actually comes close to how human readers understand in-text variation?
4. What the audience needs to take away:
    - A broader understanding of the affordances of data models and how they may serve us best
    - The additional value of looking at textual objects from an informational perspective, so as to find the best algorithmic translation
    - How thinking about the best translation also compels you to think about how _you_ understand that specific phenomenon; e.g. is is actually correct to say in-text variation constitutes non-linear text; is non-ln
    - The strenghts of TAG for modeling complex textual features
    - The challenges of launching a new data model without completely disgarding the existing ones (I am quite aware we are not the first ones to have come up with "the best" way to, well, everything)

# Data models for text encoding

The datamodel we use influence our text encoding and defines what textual features we can record. The ways in which we algorithmically express information about text also define how that information can be parsed, queried, and visualised. In short: understanding the strengths and limitations of a data model is vital in order to arrive at a correct text encoding.

The strengths of XML, the preferred encoding format for textual scholars, are well known. In the context of studying modern manuscripts, the most important limitation of XML is the inability to encode non-hierarchical features, or more precisely: features that do not fit neatly into one hierarchy. Familiar examples are the combination of documentary and textual features that is of interest to genetically oriented scholars; less familiar examples of hierarchy-breaking features are discontinuous text or nonlinear text.

Instead of proposing ways to circumvent these limitations of XML, this presentation describes a powerful new datamodel for expressing information about complex literary texts in an idiomatic way, without having to resort to workarounds or hacks.

# Outline

1. Encoding in-text revision
    - examples
    - theoretical definition
    - informational definition
    - existing approaches to modeling non-linear text
2. A very brief introduction to the TAG model
    - data model
    - syntax
    - implications
3. Modeling in-text variation in TAG
4. Evaluation
5. Discussion


Framework of this talk is different than most other talks of this symposium: instead of describing a particular use case, I present a new data model for text that is designed to handle a wide variety of textual features. For that reason, I will use different textual fragments as example, but I invite you to consider how our approach will be able to accommodate your cases.

## Encoding _in-text_ revision

### What is in-text revision?
In essence, _in-text_ revisions constitute non-linear text.

### Examples 

(Cf. the overview of textual alterations in the TEI Working Group's _Encoding Model for Genetic Editions_ 2010, §3.2)

#### Deletion

#### Addition

#### Substitution

### Other instantiations of non-linear text

#### Instant corrections

#### Open variants

#### Transpositions

### How does non-linear text translate informatically?

"All documents are structured, but some documents are more structured than others."

- unordered information
- ordered information
- partially ordered information 

### Non-linear text = partially ordered information

### Example

For example, the use of TEI-XML elements to represent regularization (orig/reg), correction (sic/corr), or abbreviation (abbr/expan) is ordered in the sense that two XML documents that differ in the order of an orig/reg choice are different XML documents, and that difference can be ignored only at the application level.

### Non-linear text in existing data models

- **String (e.g. plain text)**  

- **Key:value pairs (e.g. JSON)**  

- **Tree (e.g. XML)**  

- **Graph (e.g. RDF or GODDAG)**  

The affordances and limitations of a prevailing technology may blind us to aspects not supported by that technology.

- **String (e.g. plain text)**  
One order of tokens; no discontinuity; no overlap

- **Key:value pairs (e.g. JSON)**  
Unordered (order is not informational), hierarchical data is supported but it's not really usable for long texts

- **Tree (e.g. XML)**  
Order is informational unless indicated otherwise; discontinuity is not a problem (with linking); no overlap without workarounds like standoff. XML is a linearization of a tree structure.

- **Graph (e.g. RDF or GODDAG)**  
Graph: there are a lot of different graphs. For most graphs, overlap is not an issue (although it may take some workarounds, see EARMARK re: RDF); GODDAG allows for multiple parenthood and multiple orders of the text tokens (leafs)

In the words of Patrick Sahle (2013): the affordances and limitations of a prevailing technology may blind us to aspects not supported by that technology.


## Our approach

### TAG

**Hypergraph model for text**  

(several slides explaining the most relevant parts of the hypergraph model for text:)

- TAG data model: non-uniform cyclic property hypergraph of text
- Markup as sets on other nodes; image of hypergraph. This means that markup can point to multiple text nodes and vice versa
- TAG is designed to be able to model (and TAGML is designed to be able to encode) text and markup, including overlapping markup and ordered, partially ordered, and unordered information.


### Implications of the TAG model

This design principle means that TAG processing can support any type of query, from Boolean to ranked pattern matches at the level of the model, and that the complex mixture of information can be parsed and processed in an idiomatic manner and without work-arounds.

Why is this important?

1. Idiomatic modelling
2. Separation of responsibilities

1. When we structure information in a way that is close to how we understand it to be, our means of storing and querying become more powerful. An example: the value of XML attributes is always a string (`<date year="1928"/>`) whereas in TAGML this can be an actual number (`[date year=1928><date]`)

2. TAG is designed according to the principle of separating responsibilities. In XML, most responsibilities are transferred to the application layer; in TAG we organise and structure information at the level of the model. Many textual aspects discussed in this paper can be modeled in, for instance, an XML-transcription with an associated schema and application-level rules. TAGML, however, moves much of that responsibility to the syntax by having explict encoding mechanisms for containment, dominance, discontinuity, non-linearity, and overlap, with the goal of removing ambiguity from the application level. Accordingly, TAGML brings together and expands on qualities of existing formats, and creates an inclusive and definite framework for modeling textual and structural information. I will return to why this is important later.

#### Syntax: TAGML

A syntax is the serialisation of a data model.

Properties of TAGML:

- asymmetrical tags
- not just strings and markup, but other data types as well (numbers or boolean values)
- annotation can be nested

# Evaluation

What are the effects of using TAG to model non-linear text?

### Preventing the information overload

**Layers**

Layers are used to group together Markup nodes.

# Future work

## Collation

- "How is collation to approach textual data that cannot be undoubtedly assigned to a specific writing stage?"
- "It is however an open question as to whether inter-document discrepancies at the dossier level should be regarded in the same way as intra-document alterations. If two witnesses are collated, we may observe that a word present in one is missing from the other: does it necessarily follow that this is an addition or a deletion, which we would not hesitate to mark with an add or del tag if we are transcribing a single manuscript?" (from the TEI Encoding Model for Genetic Editing)


# OUD

In this talk, I focus on a small but significant aspect of the study of literary manuscripts: **capturing their multidimensional nature**. When we talk about "multidimensial nature" in textual studies we usually mean the combined set of characteristics: textual aspects, material aspects and chronological aspects.

In standard text encoding models that are based on the single-hierarchy of XML, one of these dimensions has to be given priority. Aspects from other dimensions can be recorded but do not have natural place in the data structure.

# Context

_"The real problems arise when dealing with modern manuscript material"_ (Vanhoutte 2002; still holds true seventeen years later)

Alternative encoding models for the encoding of draft manuscripts: Edward Vanhoutte endeavourd to "put time back into manuscripts" (2002); Elena Pierazzo also discussed this chronological aspect as she studied how to represent the discontinuous nature of the writing process on working manuscripts (_Text Editing, Print, And the Digital World_; 2016).

# Problem description

I don't have to remind you about the aims of textual genetic studies; these have been thoroughly defined by scholars like Almuth Grésillon (1994) or Daniel Ferrer (1995). We have gathered here today to witness the results of the matrimony of textual genetic studies and computer technologies. 

## Nonlinear text

### Representing non-linearity in a datamodel

### TEI/XML

### Alternative encoding methods

- LMNL (ranges on plain text)
- GODDAG (data model for overlapping hierarchies; serialization TexMECS or XCONCUR)
- Standoff markup / standoff properties
- RDF for texts (serialization: EARMARK)

### TAGML

Describe why these alternative methods do not meet the requirements of our case.

# Approach