# Theory of collation
## Gothenburg model

The Gotenburg model emerged from a meeting of the developers of CollateX and Juxta in Gothenburg, Sweden in 2009 at a joint workshop of the EU-funded research projects COST Action 32 and Interedition. (http://wiki.tei-c.org/index.php/Textual_Variance) The model conceptualizes collation as a pipeline that involves the following stages, which may be modularized, so that one or another can be modified without requiring revision of the rest of the system. Those stages are:

1. Tokenization
1. Normalization/regularization
1. Alignment
1. Analysis
1. Visualization/output

The description of these five stages below presupposes that the texts to be collated have already been transcribed and digitized.


### 1. Tokenization
*Tokenization* is the division of a continuous text into units to be aligned (called *tokens*). Most commonly, tokens are whitespace-delimited words, but tokenization can be performed at any level of granularity, e.g., “syllables, words, lines, phrases, verses, paragraphs, or text nodes” (http://wiki.tei-c.org/index.php/Textual_Variance). Challenges that arise during tokenization include the following:

* **Ambiguity.** In texts written without spaces between works (*scriptio continua*), the division into words may be ambiguous, that is, it may be possible to divide the same continuous writing in two ways, either of which would be linguistically correct.
* **Punctuation.** Punctuation is commonly tokenized by itself, so that, for example, “cat” and “cat,” (without and with a trailing comma) will be recognized as instances of the same word. The situation is less clear with non-final punctuation, though, such as hyphenated words.
* **Contractions** like English “doesn’t” or “can’t” raise questions about whether they should be treated as one word or two for the purpose of collation.

The preceding issues affect tokenization on an intellectual level in that they involve decisions by the researcher that would arise with or without a computational environment. Machine-assisted tokenization, though, raises additional challenges, including the following: 

* **Word-internal punctuation** means that it cannot safely be assumed that a word is a continuous sequence of alphabetic characters, and that a punctuation character indicates the beginning of a new token. In addition to the ambiguities involving hyphenation and English negative contractions, consider:
    * **Lexical contraction.** The contraction of “Amsterdam” as “A’dam”, which is lexically specific.
    * **Punctuation before bound morphemes.** The English “-’s” possessive is different from English negative contractions in “-n’t”. The “-n’t” portion might be understood as a variant spelling of “not” (insofar as “doesn’t” may be replaced by “does not”, etc., without violating English language norms), but the “-’s” possessive particle does not have a free-standing lexical counterpart in modern English.
* **Superscription.** In some scribal practice, such as Church Slavonic manuscripts, it is common to write some letters as superscripts, and the base and superscript letters may belong to different words. For example, “ona že” (‘and she’) may be written as “ona<sup>ž</sup>”. The visual form that appears in the manuscript isn’t easily reproduced here in a web interface, but the superscript “ž” is not merely raised, but also centered over the “a” that ends the preceding word.
* **Markup**, such as XML, may be intermingled with textual content, and XML element tags may surround an entire word or part of a word, or they may begin inside one word and end inside another, which poses special challenges for tokenizating in a way that does not contradict XML well-formedness. Even where well-formedness is not an issue, researchers may differ in their preferences for taking markup into account when collating a set of witnesses.
