<div align="center"><h1>Automated collation of nonlinear text</h1>
<h3>or: how to describe textual revisions to a computer</h3>
<br/>
<h4>Elli Bleeker, Bram Buitendijk, Ronald Haentjens Dekker
    <br/>R&amp;D group</h4>
    <h6>Huyygens webinar – April 3476th, 2020</h6>
</div>

## How do we, as humans, perceive and process information?

## How can we impose that view on data processing machines?

## What constructs and tactics do we use to cope with complexity, ambiguity, incomplete information, mismatched viewpoints, and conflicting objectives?

After William Kent, _Data and Reality_ (2002 [1978])

<img src="images/brulez-fragment-manuscript.png">

Source: xxx

<img src="images/brulez-transcript-ms.png">

`<tei>Het <subst><del>was heerlijk</del><add>zalig</add></subst> zich aldus eens te laten beleedigen...</tei>`

<img src="images/brulez-fragment-typoscript.png">

Source: xxx

<img src="images/brulez-transcript-ts.png">

`<tei>Zalig was het <subst><del>zich aldus te laten beleedigen</del><add>in die stortbui van beleedigingen te staan; heerlijk om</add></subst>...</tei>`

### Revision layers

**Manuscript**
1. Het was heerlijk zich aldus eens te laten beleedigen
2. Het <del>was heerlijk</del> zich aldus eens te laten beleedigen
3. Het zalig zich aldus eens te laten beleedigen

**Typescript**
1. Zalig was het zich aldus te laten beleedigen
2. Zalig was het <del>zich aldus te laten beleedigen</del>
3. Zalig was het in die stortbui van beleedigingen te staan; heerlijk om ...

In [1]:
from collatex import *

In [2]:
collation = Collation()

In [3]:
collation.add_plain_witness( "A", "<tei>Het <subst><del>was heerlijk</del><add>zalig</add></subst> zich aldus eens te laten beleedigen...</tei>" )
collation.add_plain_witness( "B", "<tei>Zalig was het <subst><del>zich aldus te laten beleedigen</del><add>in die stortbui van beleedigingen te staan; heerlijk om</add></subst>...</tei>" )

In [10]:
alignment_table = collate(collation, layout='vertical', segmentation=True)

In [5]:
print(alignment_table)

+----------------------+---------------------+
|          A           |          B          |
+----------------------+---------------------+
|        <tei>         |        <tei>        |
+----------------------+---------------------+
|         Het          |    Zalig was het    |
+----------------------+---------------------+
|     <subst><del>     |     <subst><del>    |
+----------------------+---------------------+
|     was heerlijk     | zich aldus te laten |
|                      |      beleedigen     |
+----------------------+---------------------+
|     </del><add>      |     </del><add>     |
+----------------------+---------------------+
|        zalig         | in die stortbui van |
|                      |   beleedigingen te  |
|                      |  staan; heerlijk om |
+----------------------+---------------------+
|    </add></subst     |    </add></subst    |
+----------------------+---------------------+
| > zich aldus eens te |        >...</       |
|        late

Simply collating the XML/TEI transcriptions is clearly not going to get us far. So should we select one writing stage per witness? But which stage do we select? And would that choice hold up throughout the text?

**Manuscript**
1. Het was heerlijk zich aldus eens te laten beleedigen
2. Het <del>was heerlijk</del> zich aldus eens te laten beleedigen
3. Het zalig zich aldus eens te laten beleedigen

**Typescript**
1. Zalig was het zich aldus te laten beleedigen
2. Zalig was het <del>zich aldus te laten beleedigen</del>
3. Zalig was het in die stortbui van beleedigingen te staan; heerlijk om ...

## Research question: 

### how can we align our (human) understanding of textual revisions with the computer's, in order to improve the collation output?

## Part 1. Structures of digital text
## Part 2. Automated collation and XML documents
## Part 3. R&D: Research is always under Development

<div align="center"><h1>Part 1 : Structures of digital text</h1></div>

<img src="data-marijke/min-ez-1954-fragment.png">

Source: xxx

<img src="data-marijke/transcript-fragment.png">

... `<tei>een <del>voldoende betrouwbare</del> <add>zekere</add> basis</tei>` ...

`<tei>een <del>voldoende betrouwbare</del> <add>zekere</add> basis</tei>`

<img src="data-marijke/viz-tree-fragment.png">

### Computer: <img src="data-marijke/viz-tree-fragment.png">

### Collation tool: 
</br>
</br><img src="data-marijke/viz-tokens-fragment.png">

There are ways (using CollateX JSON input) to pass along information about the revisions through the collation pipeline, but the alignment is carried out on the sequence of tokens.  

<img width="1200" height="1200" src="images/user-comp-tool.png">

As a result from the flattening of the text on the typescript, the deletion and the addition become one sequence of text tokens: the history of the text is erased. If you are, like Marijke, interested in exactly the succession of textual changes, this is a significant loss of information.

## A matter of translation:


### What is the human understanding of linear and nonlinear text?

## Linear text:

<img src="data-marijke/min-ez-1954-fragment2.png">

### XML transcription
`<tei>Indien met het vorenvermelde rekening wordt gehouden<tei>`

### Tree structure: <img width="500" height="500" align="center" src="data-marijke/viz-tree-fragment2.png">

### Token sequence: 

<br/>
<img width="500" height="500" src="data-marijke/viz-tokens-fragment2.png">

## Nonlinear text:

<img src="data-evina/Isidorus-f9r-fragment.png">

Source: Zofingen, Stadtbibliothek / Pa 32 – Isidorus Hispalensis, *Etymologiarum sive originum libri; De natura rerum*, f.9r (with many thanks to Evina Steinova).

Transcription: Concurrentibus enim (in se *add. sup. lin.*) invicem alfa ad W (usque devolvitur *add. in marg.*)

TEI/XML: `<tei>Concurrentibus enim <add place="above">in se</add> invicem alfa ad W <add place="margin">usque devolvitur</add></tei>`

As a human reader, I would say: this text can be read in several orders. In other words, there is a point in the text where the words no longer form a linear sequence. I may be biased(!), but I consider a variant graph a good representation of the two ways of reading this text:

<img src="data-evina/variant-graph-isidorus-f9r.png">

## Our definition of text:

### Text is a multilayered, nonlinear construct containing information that is at times ordered, unordered, and partially ordered.

<div align="center"><h1>2. Automated collation and XML</h1></div>

## Current approaches to collating nonlinear text

### 1. Selecting one writing stage as a witness

### 2. Creating pseudo-witnesses of the different writing stages

### 3. Using the JSON input to pass along relevant markup information
- Only with CollateX
- The information is *not* used during the alignment, but can be used to visualise the output

### 4. Using a diff algorithm (a delta) to compare the XML documents

## HyperCollate

**[Switch to HyperCollate notebook]**

<div align="center"><h1>3. Research under Development</h1></div>

## Ongoing issues: 

### Open variants
[example]

### Visualisations of the output
[example]

### (Insignificant) whitespace between elements representing revisions

### Transpositions

### Immediate corrections (*Sofortkorrektur*, revision *currente calamo*)