# Addressing Ancient Promises
## _Text Modeling and Alexandria_

Elli Bleeker  
_Research and Development - Humanities Cluster, Royal Academy of Arts and Sciences_  
2 November 2018

In _The Call of the Cthulhu_ (1926), science fiction writer H.P. Lovecraft wrote:

_"The most merciful thing in the world, I think, is the inability of the human mind to correlate all its contents. We live on a placid island of ignorance in the midst of black seas of infinity, and it was not meant that we should voyage far. The sciences, each straining in its own direction, have hitherto harmed us little; but some day the piecing together of dissociated knowledge will open up such terrifying vistas of reality, and of our frightful position therein, that we shall either go mad from the revelation or flee from the deadly light into the peace and safety of a new dark age."_

Despite the ominous picture that is painted here, humankind has had the desire to "piece together dissociated knowledge" for centuries. Around 290 BC, King Ptolemy the first or his son, king Ptolemy the second (the record's not clear on that) charged the founding of a research institute which they called the "Mouseion" in honorary mention of the Muses. Part of this research institute was the library of Alexandria, famous to this day, which had for objective to be a "universal library": they wanted to "collect all the books in the world" and, notably, "the writings of all men _as far as they were worth any attention_". As all archival collections, the library of Alexandria was established and created on a number of selection criteria. (We'll get back to that later.)

    [image of library]

The libraries were collectors of information and created what you can see as the first centre of knowledge production. Their methods of examining copies and establishing "the best" form the base of textual criticism. Scholars from over the civilised world came to work in the Mouseion. They had considerable academic freedom and made avid use of the libary's collection which resulted in high-end scholarship. 

The goal of the Mouseion and Alexandria was

"to unite all scholarly knowledge"

There was undeniably a political reasoning behind this objective: the ancient Egyptian rulers certainly understood the truth behind the idiom "knowledge is power" and were pretty rigorous in their methods of aquisition and their dissemination policy. But, politics aside, the combination of a vast library collection and an international research center facilitated the production of more knowledge, and led to advances in the fields of natural sciences, geography, mathematics, literature, and philosophy. I think it safe to assume that today's digital libraries and archives still have this noble ambition. 

But what is that, exactly: _unite all scholarly knowledge_? Arguably, the least problematic word in that phrase is "scholarly", although what is deemed scholarly and what is not is also a matter of debate. The other words are even trickier. What does it mean, to "unite"? Does it suffice to bring together all books in one physical place? Books may stand side-by-side on a shelve, but that doesn't mean that their contents are united, let alone synthesised. What is knowledge, actually? And how do you know if you have it all (do you ever)?

With regard to that last question, it's interesting to look at the statement embodied by  the Open World Assumption of the Semantic Web:

"No single agent or observer has complete knowledge... We admit that our knowledge of the world is incomplete."

It means that the absence of a piece of information doesn't make it false: something can be true even if we cannot prove it's true. We'll hear more about this assumption later.

Phrases like "unite scholarly knowledge" remind us of the visionaries from the early days of digital textual scholarship, who saw in the digital medium the way to "make information accessible for everyone to look at from different perspectives, across disciplinary divides", creating every expanding knowledge sites that transcend individual project "silos" and follow the linked data principles. And those of you working in digital textual scholarship will know that, despite our best attentions, we have not yet reached that goal.

In a recent review-turned-reflection on Mirador, my colleague Joris van Zundert notes that, in fact, these data silos still exist. He identifies at least two causes: the limited financial resources (intra-institutional collaboration is more costly, both in time and in money, than a local solution) and the convenience of the developers who incidentally _also_ have to deal with limited time and money. In short, even though _in principle_ most scholars would agree that

"... keeping tools and data locked in one place behind a one-size-fits-all interface would be in stark contrast to the heterogeneous demands and requirements scholars had for document resources that were spread, quite literally, across the planet..."

It remains our reality.

In this talk, I examine how digital text modeling can, potentially, fulfil the objective of the great research libraries of the past: unite and disseminate scholarly knowledge. I identify three requirements (or let's call them "desiderata", as a salute to Peter Robinson): 

1. an **inclusive data model** that 
    
    1.1. allows for advanced textual features 
    
    1.2. provides a way to formalise the meaning of markup; 


2. **support for multiple perspectives on a source text**;  


3. an editing and version management tool that is platform-independent.

The discussion of these desiderata form primarily the background for a presentation of Alexandria, a text repository system that is based on a new data model called TAG, both under development at the R&D group of the HuC.

After the usual definitions I'll go in broad sweeps over concepts like information modeling, with which I'm sure all of you are familiar, but it gives me the chance to abuse computers a bit while citing computer scientists.

Because these requirements are related to the principles of the Semantic Web, the Linked Open Data (LOD) framework, and the ideals of a distributed architecture, I will touch upon those topics as well, but only in passing: the main goal of my talk is to establish an informed discussion about information modeling, data structures and digital textual scholarship. And, of course, to present and promote our own _Alexandria_.

## Definitions

First, some definitions (which I'll probably use interchangeably anyway).

**Information**

Information-as-process  

Information-as-knowledge  

Information-as-thing

**Data**

"Literary texts cannot be reduced to data because they are too complex, very messy and never complete" 
(Marche 2012)

**Knowledge**

**Modeling**

By definition, 

a model is always a selection of the original

"We always pay a price for theoretical knowledge: reality is infinitely rich; concepts are abstract, are poor. But it’s precisely this 'poverty' that makes it possible to handle them, and therefore to know"

Still, it has been considered

"the holy grail of computer science to algorithmically express the natural world"

Why is this so hard?

Because the natural world is ambiguous and complex; its products open to many different interpretations.

The computer, on the other hand, is rather dumb. Simplistic, if you will. William Kent, described computer programming 

"... the art of getting and imbecile to play bridge or to fill out his tax returns by himself. It can be done, provided you know how to exploit the imbeciles limited talents, and are willing to have enormous patience with his inability to make the most trivial common sense decisions on his own."

"The first step toward understanding computers is an appreciation of their simplicity, not their complexity."

Still, despite this utter stupidity we're apparently dealing with on a day to day basis, we can already do lots of cool stuff with a computer when it comes to modeling complex documents. Clearly these achievements can be wholly and uniquely attributed to us, intelligent human beings!

This brings me to the first point, the first "desideratum" if you will:

## 1. A flexible and inclusive data model

We have at our disposition a number of different data models to express and represent text. I'll give a brief outline of the most common or extraordinary ones. Keep in mind the difference between a data model and a syntax! A syntax is a serialisation of a data model, e.g., XML is a serialisation (= a linearly ordered expression) of the OHCO model, but SGML as well.