# Addressing Ancient Promises
## _Text Modeling and Alexandria_

Elli Bleeker  
_Research and Development - Humanities Cluster, Royal Academy of Arts and Sciences_  
2 November 2018

In _The Call of the Cthulhu_ (1926), science fiction writer H.P. Lovecraft wrote:

_"The most merciful thing in the world, I think, is the inability of the human mind to correlate all its contents. We live on a placid island of ignorance in the midst of black seas of infinity, and it was not meant that we should voyage far. The sciences, each straining in its own direction, have hitherto harmed us little; but some day the piecing together of dissociated knowledge will open up such terrifying vistas of reality, and of our frightful position therein, that we shall either go mad from the revelation or flee from the deadly light into the peace and safety of a new dark age."_

Despite the ominous picture that is painted here, humankind has had the desire to "piece together dissociated knowledge" for centuries. Around 290 BC, King Ptolemy the first or his son, king Ptolemy the second (the record's not clear on that) charged the founding of a research institute which they called the "Mouseion" in honorary mention of the Muses. Part of this research institute was the library of Alexandria, famous to this day, which had for objective to be a "universal library": they wanted to "collect all the books in the world" and, notably, "the writings of all men _as far as they were worth any attention_". As all archival collections, the library of Alexandria was established and created on a number of selection criteria. (We'll get back to that later.)

<img src="images/Ancientlibraryalex.jpg">

The libraries were collectors of information and created what you can see as the first centre of knowledge production. Their methods of examining copies and establishing "the best" form the base of textual criticism. Scholars from over the civilised world came to work in the Mouseion. They had considerable academic freedom and made avid use of the libary's collection which resulted in high-end scholarship. 

The goal of the Mouseion and Alexandria was

"to unite all scholarly knowledge"

There was undeniably a political reasoning behind this objective: the ancient Egyptian rulers certainly understood the truth behind the idiom "knowledge is power" and were pretty rigorous in their methods of aquisition and their dissemination policy. But, politics aside, the combination of a vast library collection and an international research center facilitated the production of more knowledge, and led to advances in the fields of natural sciences, geography, mathematics, literature, and philosophy. I think it safe to assume that today's digital libraries and archives still have this noble ambition. 

But what is that, exactly: _unite all scholarly knowledge_? Arguably, the least problematic word in that phrase is "scholarly", although what is deemed scholarly and what is not is also a matter of debate. The other words are even trickier. What does it mean, to "unite"? Does it suffice to bring together all books in one physical place? Books may stand side-by-side on a shelve, but that doesn't mean that their contents are united, let alone synthesised. What is knowledge, actually? And how do you know if you have it all (do you ever)?

With regard to that last question, it's interesting to look at the statement embodied by  the Open World Assumption of the Semantic Web:

"No single agent or observer has complete knowledge... We admit that our knowledge of the world is incomplete."

It means that the absence of a piece of information doesn't make it false: something can be true even if we cannot prove it's true. We'll hear more about this assumption later.

Phrases like "unite scholarly knowledge" remind us of the visionaries from the early days of digital textual scholarship, who saw in the digital medium the way to "make information accessible for everyone to look at from different perspectives, across disciplinary divides", creating every expanding knowledge sites that transcend individual project "silos" and follow the linked data principles. And those of you working in digital textual scholarship will know that, despite our best attentions, we have not yet reached that goal.

In a recent review-turned-reflection on Mirador, my colleague Joris van Zundert notes that, in fact, these data silos still exist. He identifies at least two causes: the limited financial resources (intra-institutional collaboration is more costly, both in time and in money, than a local solution) and the convenience of the developers who incidentally _also_ have to deal with limited time and money. In short, even though _in principle_ most scholars would agree that

"... keeping tools and data locked in one place behind a one-size-fits-all interface would be in stark contrast to the heterogeneous demands and requirements scholars had for document resources that were spread, quite literally, across the planet..."

It remains our reality.

In this talk, I examine how digital text modeling can, potentially, fulfil the objective of the great research libraries of the past: unite and disseminate scholarly knowledge. I identify three requirements (or let's call them "desiderata", as a salute to Peter Robinson): 

# Desiderata 

1. an **inclusive data model** that 
    
    1.1. allows for advanced textual features 
    
    1.2. provides a way to formalise the meaning of markup; 


2. **support for multiple perspectives on a source text**;  


3. an **editing and version management tool** that is **platform-independent**.

The discussion of these desiderata form primarily the background for a presentation of Alexandria, a text repository system that is based on a new data model called TAG, both under development at the R&D group of the HuC.

After the usual definitions I'll go in broad sweeps over concepts like information modeling, with which I'm sure all of you are familiar, but it gives me the chance to abuse computers a bit while citing computer scientists.

Because these requirements are related to the principles of the Semantic Web, the Linked Open Data (LOD) framework, and the ideals of a distributed architecture, I will touch upon those topics as well, but only in passing: the main goal of my talk is to establish an informed discussion about information modeling, data structures and digital textual scholarship. And, of course, to present and promote our own _Alexandria_.

## Definitions

First, some definitions (which I'll probably use interchangeably anyway).

**Information**

Information-as-process  

Information-as-knowledge  

Information-as-thing

1. **information-as-process**: the process of acquiring information, which may lead to knowledge.
2. **information-as-knowledge**: intangible. It needs to be expressed in a physical way, as a signal or communication. The expression makes the third form:
3. **information-as-thing**: tangible. Books, databases, bits, bytes, files, etc. Data. Text. Can be touched or measured. Representations of knowledge (because knowledge itself cannot be made tangible). Defined by Buckland as "evidence": it is related to understanding but in a passive way, as inferences can be drawn from it. This is the main focus of our work. 

**Data**

"There is a tendency to use "data" to denote numerical information and to use "text" to denote natural language in any medium. 

"Literary texts cannot be reduced to data because they are too complex, very messy and never complete" 
(Marche 2012)

Still, I'd argue that this comment rather protests what "data" has come to mean in DH: the reduction of literary texts to quantifiable objects. In a similar vain, William Kent said that

"Information in its 'real' essence is probably too amorphous, too ambiguous, too subjecetive, too slippery and elusive, to ever be pinned down precisely by the objective and deterministic processes embodied in a computer."

Personally, I see no objection to the term: before they can be processed and analysed, literary texts are transformed into data with a certain structure. There's no question that this transformation entails a reduction and a loss of information, but that's rather the nature of modelling.

**Modeling**

By definition, 

a model is always a selection of the original

"We always pay a price for theoretical knowledge: reality is infinitely rich; concepts are abstract, are poor. But it’s precisely this 'poverty' that makes it possible to handle them, and therefore to know"

Still, it has been considered

"the holy grail of computer science to algorithmically express the natural world"

Why is this so hard?

Because the natural world is ambiguous and complex; its products open to many different interpretations.

The computer, on the other hand, is rather dumb. Simplistic, if you will. William Kent, described computer programming 

"... the art of getting and imbecile to play bridge or to fill out his tax returns by himself. It can be done, provided you know how to exploit the imbeciles limited talents, and are willing to have enormous patience with his inability to make the most trivial common sense decisions on his own."

"The first step toward understanding computers is an appreciation of their simplicity, not their complexity."

Still, despite this utter stupidity we're apparently dealing with on a day to day basis, we can already do lots of cool stuff with a computer when it comes to modeling complex documents. Clearly these achievements can be wholly and uniquely attributed to us, intelligent human beings!

This brings me to the first point, the first "desideratum" if you will:

## 1. A flexible and inclusive data model

We have at our disposition a number of different data models to express and represent text. I'll give a brief outline of the most common or extraordinary ones. Keep in mind the difference between a data model and a syntax! A syntax is a serialisation of a data model, e.g., XML is a serialisation (= a linearly ordered expression) of the OHCO model, but SGML as well.

| Data model || Serialisation ||
|:------:||:--------------------:||
|String || plain text (ranges) ||
|Tree || JSON ||
|OHCO|| SGML, XML ||
|Graph||Turtle, N-Triple, RDF/XML... ||


I'll talk about fourth data format, the hypergraph, in a bit. Before we go on, I want to emphasise that each of these formats has disadvantages but also their own merits. As Fabio Vitali argued a few years back, with the help of some coding and hacks you can express almost everything in every data model (though it ain't always pretty).

<img src="images/vitali-data-formats.png">

Making use of what is already there has a lot of benefits: it is more or less stable, people know and understand it, there's usually a community, tutorials and tools made to work with that format. The downside, however, is a sneaky one: if we use certain models long enough, they can influence the way we think and argue about text. They can even - very subtly - encourage us to ignore certain features that are not represented in that particular model. Patrick Sahle noted, with regard to TEI,

"TEI konzentriert sich for allen auf den Text als Werk(-Struktur), als sprachliche Äußerung und als kanonisierte, definierte oder auch variante Fassung. Der Text als intentionale Mitteilung, als semantischer Inhalt, aber auch der Text als physisches Object, als Dokument, wird nur am Rande unterstützt. Der Text als komplexes Zeichen, als semiotic Entität, spielt bei der TEI keine Rolle"

In short: I am not saying that expressing complex textual features in RDF is impossible. It may lead to a reduced human-readability of the file (which is more important than you may think, especially when it concerns humanities scholars), but you may argue that the file is not intended to be read by humans, as long as machines understand it. Another downside is that it also reduces the reusability of the data.

Let's take a closer look at what I mean, exactly, with "complex textual features" that are hard to express. Most of these examples come from modern literary manuscripts, altough you can find them anywhere.

# Diffcult textual features 

- Overlapping structures  
- Discontinuous elements  
- Non-linear elements

# Overlapping structures
<img src="images/Selection-21v.png">
<img src="images/Selection-22v.png">

# Discontinuous elements
<img width="300" height="300" src="images/order.jpg">

# Non-linear structures
<img width="500" height="500" src="images/code-nonlinear.png">

<img align="center" width="500" height="300" src="images/order1a.png">

<img align="center" width="500" height="300" src="images/order1b.png">

If you are interested in such phenomena, and if you're modelling text I can hardly imagine you're _not_ interested in them, then there's value in using a data model that is close to your understanding of text. A model that can deal with these features natively. 

With this in mind, we developed the hypergraph model for text. Like the graphs we're used to, a hypergraph consists of nodes and edges, but it also has hyperedges that can connect more than two nodes. To have the model support our understanding of text, we needed to define it as precisely as possible. In our definition,

Text is a multilayered, nonlinear object containing information that is at times ordered, partially ordered and unordered.

[image of discontinuity in hypergraph]

[image of non-linearity in hypergraph]

Describing how the TAG hypergraph model deals with complex texts can take of the rest of the evening. In addition to the data model, we've designed a serialisation (a markup language) called TAGML, which requires a grammar and a schema and that poses challenges of its own. If you want, we can get back to it later. For now, let's move on to the second desideratum:

# Support for multiple perspectives

What do I mean with that? Well, remember the Open World Assumption that we can never be entirely sure we know everthing? You can also put it this way:

Scholars disagree on everything.

This is very much okay, because opposing views are the driving force behind research. But views do not always oppose one another, they may also happily coexist. In digital text modeling, these coexisting views are often a cause for overlap.

[example of the Faust edition]

In the TAG model, overlap is not an issue anymore. But how can we deal with these coexisting yet different views on the same data? Our solution is

## Layers

Layers are, in fact, TAG's solution to many issues.

Layers classify a set of markup nodes.

These nodes can be classified as belonging to a certain research perspective (like nodes expressing the materiality of a document, or nodes expressing linguistic information) or they identify which user has added the markup (like all markup added by Elli, or all markup added by Frederike). 

Layers are hierarchical: the nodes within one layer are hierarchically ordered. 

Layers may share markup nodes and textual nodes.

In other words: layers may overlap, but within one layer there can be no overlapping markup. This feature touches upon the discussion of containment and dominance, but we'll get to that when we have time. The layers in TAG are similar to Multi-Colored Trees (MCT) and XCONCUR, except that 

1. the textual content can change.
2. the layers can start locally (in contrast to XCONCUR where you can indicate that an element belongs to multiple hierarchies but these are entire trees)

In the second part of this lecture, I'll give a demo to show how layers work, but here you can already see some examples in fragments of a TAGML transcription:

## Astrid
```
[TAGML>
[page>
[p>
[line>2d. Voice from the Springs<line]
[line>Thrice three hundred thousand years<line]
[line>We had been stained with bitter blood<line]
<p]
<page]
[page>
[p>
[line>And had ran mute 'mid shrieks of slaughter<line]
[line>Thro' a city and a multitude<line]
<p]
<page]
<TAGML]
```

## Astrid
```
[TAGML>
[page|+L1>
[p|+L2>
[line>2d. Voice from the Springs<line]
[line>Thrice three hundred thousand years<line]
[line>We had been stained with bitter blood<line]
<page|L1]
[page|L1>
[line>And had ran mute 'mid shrieks of slaughter<line]
[line>Thro' a city and a multitude<line]
<p|L2]
<page|L2]
<TAGML]
```

[figure of MCT with layers]

## Bram
```
[TAGML>
[page|+A,+L1>
[poem|+B>
[p|A,+L2>
[sp|B>[l|B>[line|A>2d. Voice from the Springs<l]<line]
[line|A>Thrice three hundred thousand years<l]<line]
[line|A>We had been stained with bitter blood<l]<line]
<page|L1]
[page|A,L1>
[line|A>[l|B>And had ran mute 'mid shrieks of slaughter<l]<line]
[line|A>[l|B>Thro' a city & a multitude<l]<line]
<p|L2]
<sp]
<page]
<TAGML]
```


[Figure of MCT with layers] 

Working with layers has a number of important implications.

## Implications

- abandoning a shared conception of digital editing in favor of multiple perspectives
- coexisting views, not one view leading
- increased readability _and_ reusability
- documentation: clearly communicate what information is in a file

With regard to the readability and reusability: this is the case when filtering out one or more layers, e.g., "show me only the markup with the layer ID [A]". With regard to the documentation: we're working on generating documentation to a certain extent.

We have come to the third and final "desideratum" or required component: a platform-independent system to work with TAG files containing multiple layers.

# Distributed system

First, let's talk about the system.

You may imagine (or not, but then let me point it out to you) that a workflow with layers is going to be quite interesting. For starters:
- it implies a comparison between two marked-up files, finding changes in markup, text or both. 
We have developed a versioning tool that is able to do that (although it requires more testing) so the challenge is sooner conceptual (or philosophical) than technical.

It comes down to questions like:

What defines "change"? How do you identify a new version? 

Do you want to label the changes as additions (keeping the first version) or as substitutions (disgarding the first version)? If I change Astrid's `[line>` markup to `[s>`, should we keep the `[line>` information as well?

Is a perspective only markup, or also text? 

Intuitively we'd say that a perspective consists of text _and_ markup: in one perspective, a transcriber can identift a typo with `[sic> [corr>` while another transcriber may choose to disgard the typo and silently correct it. As a result, one perspective contains both `slaugter` and `slaughter` while the other perspective only contains `slaughter`.

Still, there are no ready answers to these and similar questions. On the contrary, there are possibly opposing views, but as I've just argued that those lead to innovation that's fine. We just need more testing and more people to work with TAG and profound philosophical discussions about markup, modeling and text. We've developed a work environment to support just this: testing and experimenting.

The adjective "distributed" reminds of Joris' remarks I mentioned earlier. It is based on the truism that if you want a large community, that has developed its own set of preferred methods, tools and convictions, to drop all that and adopt your does-it-all tool, you're gonna fail. _Alexandria_ is therefore implemented as a command-line tool without an interface. It's a skeleton, a basic structure for which you can developed your own interface, if you please, or which you can integrate in an existing environment of choice. It is not dependent on a platform or an operating system and fits within a distributed architecture.

_Alexandria_ works similar to git and the concurrent XML workflow, but in this case you can edit both text nodes and markup nodes.

<img align="center" src="images/workflow-alexandria_v0.4.png">

<img align="center" src="images/workflow-alexandria-remote-repo_v.2.png">

[A brief live demo of Alexandria, using the same transcriptions as earlier for consistency]

"There is also a graveyard somewhere for scholarly transcription environments... In the case of transcription tools, a defining trope would be that the tool was built as an _integrated transcription environment_"

With Alexandria, in short, we do _not_ provide an integrated transcription environment that will, in due time, find it's way to its friends on a graveyard, but we provide a flexible tool that can be integrated in an editorial workflow.

Not coincidentally, _Alexandria_ is created by two developers who were also part of the _Interedition_ project (2007-2013) that can be accredited with tools like CollateX and StemmaWeb among others. _Alexandria_ is an open source software component that allows for customisation.

Using _Alexandria_ and the TAG data model does have some epistemological consequences for our understanding of a version, of text, of information produced by scholars, etc. I'll just mention some, without necessarily answering them. Food for thought and a basis for the discussion.

- Can we represent information about text so that others can meaningfully interact with it?
- What does it mean when our textual models are _conceptually_ no longer limited by a particular format or structure?
- Where do we stand regarding ambiguity? 

The layer-feature of TAG allows you to express ambiguous information about a text. On the other hand, there's significant value in disambiguating the tags used for each layer, for instance by linking them to a sort of ontology that formalises the semantics of your tag set. This facilitates processing, querying and reusing. Now that I mention "ontology" we may also have to discuss semantics in realition to TAG. If there's time...

# In conclusion

With the TAG data model and the reference implementation _Alexandria_ we've taken some significant steps towards that holy grail of computer science: expressing the natural world algorithmically so that the computer - that imbecile - can understand and be an even better instrument for scholarly research. A while ago, Wendell Piez said of his LMNL data model that

"both the analysis and the representation of literary texts are enabled in new ways by being able to overlap freely in the model"

We can paraphrase that, saying

Both the analysis and representation of, and the collaboration on literary texts are enabled in new ways by being able to natively express complex textual features in the model.

I want to emphasise here that we shouldn't pass lightly over the value of existing models and data structures. RDF and XML have much going for them:

- It's easy to start right away; there are tutorials all over the web
- There's never enough time and money; if you're on a tight budget it can make sense to cut technical training of editors or to stay clear of experimenting
- Not everyone has revolutionary ambitions; sometimes all you want to do is publish fast and cheap.

These are solid reasons and we don't argue against them. All of it is possible in TAG, and perhaps even easier: we can follow the TEI encoding standard but avoid the complicated workarounds when you encounter forms of overlap.  

Furthermore, the TAG model entails a strict separation between responsibilities (encoding - schema - semantics), outsourcing certain responsibilities to the application layer which makes it easier to process.

Again, I cite Piez:

"The primary goal of text encoding in the humanities should not be to conform to standards ... Rather, we encode texts and represent them digitally in order to present, examine, study, and reflect on the rich heritage of knowledge and expression presented to us in our cultural legacy"

The combination of TAG and _Alexandria_ provides us with a powerful modelling tool: 

- multiple coexisting views
- inclusive data model
- modular, open source software

Even though we are in the midst of development, these features already influence how we model, think about, proceas and analyse text. The abstract objective of the ancient Museion, "to unite all scholarly knowledge" has become a concrete and even attainable goal.

I'll leave on two notes:

## Implication for Archival Studies

- the content of archives is always curated; a process of selection and exclusion
- narratives in archives are a double-edged sword; they can both highlight _and_ obscure objects
- the utter arbitrariness of curators; a collection or exposition can convey a false sense of scholarly accuracy and completeness

# Textual Awareness

Including humanist scholars in the process of text modeling is unavoidable; excluding them undesirable. Dirk van Hulle coined the term "textual awareness" to describe the understanding of how a text is made (from a textual genetics point of view); I think it apt to expand that definition by including an awareness of the process of data transformations.

- the understanding of how a text is made
- an awareness of the process of data transformations

This necessitates a different focus in the education of humanities scholars, which should at least include information modeling and background knowledge of different data structures. As William Kent wrote:

"There is a flow of ideas from mind to mind; there are translations along the way, from concept to natural language to formal language and back again..."

We should appreciate the productivity of scholarly discourse and, instead of striving towards objective descriptions of our ideas, we should formalise our interpretations.

TAG and _Alexandria_ provide us with new, unprecedented ways to do just that.

## Credits

- Research and Development team (Ronald, Bram, Astrid)


## References
- Balisage papers
- XML Prague paper
- 

**[Extra slides]**

## Containment versus Dominance

## Data typing

## Semantics