# Collation tools

Tools commonly used for the machine-assisted collation and visualization of witnesses with variation include the following:

* Collate
* diff/wdiff
* TuStep
* NMerge
* TEIComparator
* Juxta
* Versioning machine
* Traviz
* CollateX

Where we were able, we’ve used the tools to collate or compare the first paragraph of Charles Darwin’s *Origin of species* according to the six editions published during his lifetime. All tools listed below except the original Collate are open source.

## Collate
<img src="images/collate_screenshot.jpg"/>
**Collate** (http://www.hd.uib.no/humdata/2-91/robin.htm) was an automated collation tool that ran on early Macintosh computers. It is no longer distributed or maintained, but it is historically important as the first tool designed specifically to perform the sort of alignment, collation, and reporting undertaken by textual scholars in the preparation of critical editions.

## diff/wdiff
**diff** is a command-line (non-graphical) utility that compares two text files line by line, locates differences, and creates an output file that summarizes the line-level insertions and deletions that can be used to convert one of the files to the other programmatically. It is not strictly a collation and alignment tool, but it does identify lines in two files that differ textually, and in that way has some of the same functionality as full witness collation. **diff** is intended primarily for line-level comparisons in genres characterized by short lines, such as computer programs, where if the computer can find the lines that differ, the human can easily the difference. In Digital Humanities work, it is common for an entire paragraph to be encoded as a single line (that is, without any internal new line characters), which means that **diff** will report that the lines differ, but won’t identify the moments of difference. Here is the output of comparing two excerpts from different editions of Charles Darwin’s *Origin of species*. It shows that one entire paragraph (= line) must be removed from one witness and a different one inserted in its place to make them agree, without reporting on the moments of variation, which are actually rather few:

    1c1
    < WHEN we look to the individuals of the same variety or sub-variety of our older cultivated plants and animals, one of the first points which strikes us, is, that they generally differ much more from each other, than do the individuals of any one species or variety in a state of nature. When we reflect on the vast diversity of the plants and animals which have been cultivated, and which have varied during all ages under the most different climates and treatment, I think we are driven to conclude that this greater variability is simply due to our domestic productions having been raised under conditions of life not so uniform as, and somewhat different from, those to which the parent-species have been exposed under nature. There is, also, I think, some probability in the view propounded by Andrew Knight, that this variability may be partly connected with excess of food. It seems pretty clear that organic beings must be exposed during several generations to the new conditions of life to cause any appreciable amount of variation; and that when the organisation has once begun to vary, it generally continues to vary for many generations. No case is on record of a variable being ceasing to be variable under cultivation. Our oldest cultivated plants, such as wheat, still often yield new varieties: our oldest domesticated animals are still capable of rapid improvement or modification.
    ---
    > Causes of Variability. WHEN we compare the individuals of the same variety or sub-variety of our older cultivated plants and animals, one of the first points which strikes us is, that they generally differ more from each other than do the individuals of any one species or variety in a state of nature. And if we reflect on the vast diversity of the plants and animals which have been cultivated, and which have varied during all ages under the most different climates and treatment, we are driven to conclude that this great variability is due to our domestic productions having been raised under conditions of life not so uniform as, and somewhat different from, those to which the parent-species had been exposed under nature. There is, also, some probability in the view propounded by Andrew Knight, that this variability may be partly connected with excess of food. It seems clear that organic beings must be exposed during several generations to new conditions to cause any great amount of variation; and that, when the organisation has once begun to vary, it generally continues varying for many generations. No case is on record of a variable organism ceasing to vary under cultivation. Our oldest cultivated plants, such as wheat, still yield new varieties: our oldest domesticated animals are still capable of rapid improvement or modification.

“The GNU **wdiff** program is a front end to **diff** for comparing files on a word per word basis. A word is anything between whitespace. This is useful for comparing two texts in which a few words have been changed and for which paragraphs have been refilled. It works by creating two temporary files, one word per line, and then executes diff on these files. It collects the diff output and uses it to produce a nicer display of word differences between the original files.” (https://www.gnu.org/software/wdiff/) **wdiff** thus expands on the line-level orientation of **diff** to provide a word-level perspective, and it outputs a consolidated view that makes it easy to distinguish the places of agreement from the places of variation within a line:

    {+Causes of Variability.+} WHEN we [-look to-] {+compare+} the individuals of the same variety or sub-variety of our older cultivated plants and animals, one of the first points which strikes [-us,-] {+us+} is, that they generally differ [-much-] more from each [-other,-] {+other+} than do the individuals of any one species or variety in a state of nature. [-When-] {+And if+} we reflect on the vast diversity of the plants and animals which have been cultivated, and which have varied during all ages under the most different climates and treatment, [-I think-] we are driven to conclude that this [-greater-] {+great+} variability is [-simply-] due to our domestic productions having been raised under conditions of life not so uniform as, and somewhat different from, those to which the parent-species [-have-] {+had+} been exposed under nature. There is, also, [-I think,-] some probability in the view propounded by Andrew Knight, that this variability may be partly connected with excess of food. It seems [-pretty-] clear that organic beings must be exposed during several generations to [-the-] new conditions [-of life-] to cause any [-appreciable-] {+great+} amount of variation; and [-that-] {+that,+} when the organisation has once begun to vary, it generally continues [-to vary-] {+varying+} for many generations. No case is on record of a variable [-being-] {+organism+} ceasing to [-be variable-] {+vary+} under cultivation. Our oldest cultivated plants, such as wheat, still [-often-] yield new varieties: our oldest domesticated animals are still capable of rapid improvement or modification.

While the output of **wdiff** is closer to the way humanists are used to viewing textual variation than the output of **diff**, both approaches are limited by their ability to compare only two witnesses. There are graphic front ends for **diff** that highlight the differences, and an **mdiff** utility that promises to support the comparison of multiple witnesses, but currently supports word-level reporting only for two witness. Tools based on **diff** tend to have only limited support for modifying default tokenization and normalization (e.g., **diff** can be told to ignore differences in upper vs lower case). License: GPL.

## TUSTEP
TUSTEP (the TUebingen System of Text Processing Programs, http://www.tustep.uni-tuebingen.de/tustep_eng.html) was developed to support the processing of textual data for scholarly research. With the COLLATE feature of TUSTEP, “a listing of the differences between one or several text versions and a basic printing in lines synoptic to the basic text. The differences must be provided in form of correcting instructions containing a correction key, as generated by the TUSTEP programs COMPARE or PRESORT.” (http://www.tustep.uni-tuebingen.de/pdf/hdb93_eng.pdf, p. 66) The English-language documentation does not appear to have kept pace with the German. TXSTEP (http://www.txstep.de/) is an XML-oriented front end to TUSTEP. License (both TUSTEP and TXSTEP): revised BSD. 

## NMerge
https://code.google.com/archive/p/multiversiondocs/wiki/NMerge

## TEIComparator
http://tei-comparator.sourceforge.net/

## Juxta
<img src="images/juxta_screenshot.png"/>
Juxta (http://www.juxtasoftware.org/) supports the collation, comparison, and visualization of textual variation (in plain text or XML sources) as either a desktop application or a web service. Visualizations include a heat map (image above), histogram, side-by-side synoptic view, TEI parallel segmentation, an experimental critical edition layout with apparatus criticus, and output for the Versioning machine (see below). License: Apache.

## Versioning machine
<img src="images/juxta_screenshot.png"/>
The Versioning machine (http://v-machine.org/) does not perform collation, but it takes texts whose variants have already been collated and tagged with TEI parallel segmentation or location referenced markup and displays the witnesses in a way that supports the identification, highlighting, and exploration of variation. The visualization above was captured from the Versioning machine output of Juxta after Juxta performed the collation and alignment. License: GPL.

## TRAViz
<img src="images/traviz_screenshot.png"/>
TRAViz (http://www.traviz.vizcovery.org/index.html) is a JavaScript library that generates visualizations for Text Variant Graphs that show the variations among different editions of texts. TRAViz supports the collation task by providing methods to align various editions of a text, visualize the alignment, improve the readability for Text Variant Graphs compared to other approaches, and interact with the graph to discover how individual editions disseminate. (paraphrased from the TRAViz splash page). The image of the Darwin texts, above, is a static screen shot, but the TRAViz interface responds to mouse events to provide additional information about particularly moments in the visualization. License: FAL (http://www.vizcovery.org/fal.html).

## CollateX
<img src="images/collatex_screenshot.png"/>
CollateX is available in Java (http://collatex.net) and Python (https://pypi.python.org/pypi/collatex) versions. The Python version, which is the focus of this workshop, provides hooks for user modification of the Tokenization and Normalization stages of the Gothenburg collation model, and supports output as a plain text table, HTML, SVG variant graph, GraphML, generic XML, and TEI parallel segmentation XML. The materials for this workshop provide tutorial information on all aspects of using the Python version of CollateX.