# Collating for real with CollateX. Plain texts

In this exercise, follow the instructions here: read the Markdown cells and execute the Code cells (the ones with In + a number on their left). 

Not sure how to execute cells in a Notebook? Check the [Jupyter Notebook tutorial](../02_PrepareEnvironment/JupyterNotebook.ipynb).

### Delete the outputs

In this notebook, you might have already the code and the outputs, that is the results. We want to create the results afresh, so let's clear all the outputs. Go to the menu 'Kernel' and choose 'Restart & Clear Output' and confirm it when Jupyter asks for it. Wait some seconds, a blue string appears telling 'Kernel ready'; if you don't see it, don't worry, it is so quick that you might have lost it. But the Notebook is ready again.

Please note that we are clearing the results only because we want to run everything in the exercise. But if in the future you come back here, you don't need to delete the results before starting.

## Update CollateX

CollateX is already installed, but we want to make sure to have the latest version of CollateX. You don't need to do this every time, but make sure you do it regularly. 

That's why we do in the Jupyter Notebook:

In [None]:
!pip install --upgrade collatex

Or directly in the commandline: `pip install --upgrade collatex` (without the exclamation mark at the beginning of the line).

## Run CollateX

Finally, we can use CollateX.

We need to tell Python that we will be needing the CollateX package. A package or library is a program, a set of code files that together form a program. In Pythong, before using the library, you need to ask for it. Here is how you do it:

In [1]:
from collatex import *

Now we're ready to make a collation object. We do this with the slightly hermetic line of code: 

    collation = Collation()
    
Here the lower case `collation` is the arbitrary named variable that refers to a copy (officially it is called an *instance*) of the CollateX collation engine. We simply tell the collation library to create a new instance by saying `Collation()`.

In [2]:
collation = Collation()

Now we add some witnesses. Each witness gets a letter or name that will identify them, and for each we add the literal text of the witness to the collation object.

In [3]:
collation.add_plain_witness( "A", "The quick brown fox jumped over the lazy dog.")
collation.add_plain_witness( "B", "The brown fox jumped over the dog." )
collation.add_plain_witness( "C", "The bad fox jumped over the lazy dog." )

And now we can let CollateX do its work of collating these witnesses and sit back for about 0.001 seconds. The result will be an alignment table, so we'll refer to the result with a variable named `alignment_table`.

In [4]:
alignment_table = collate(collation, layout='vertical', segmentation=False )

Well, that worked nicely it seems. But there's no printout, no visualization. That's okay, we can come up with a printout of the alignment table too:

In [5]:
print( alignment_table )

+--------+--------+--------+
|   A    |   B    |   C    |
+--------+--------+--------+
|  The   |  The   |  The   |
+--------+--------+--------+
| quick  |   -    |  bad   |
+--------+--------+--------+
| brown  | brown  |   -    |
+--------+--------+--------+
|  fox   |  fox   |  fox   |
+--------+--------+--------+
| jumped | jumped | jumped |
+--------+--------+--------+
|  over  |  over  |  over  |
+--------+--------+--------+
|  the   |  the   |  the   |
+--------+--------+--------+
|  lazy  |   -    |  lazy  |
+--------+--------+--------+
|  dog   |  dog   |  dog   |
+--------+--------+--------+
|   .    |   .    |   .    |
+--------+--------+--------+


CollateX can also collect the segments that run parallel and display them together. To do that, just delete the option `segmentation=False` as in the line below. We can now collate and print the output again.

In [6]:
alignment_table = collate(collation, layout='vertical' )
print( alignment_table )

+---------------------+---------------------+---------------------+
|          A          |          B          |          C          |
+---------------------+---------------------+---------------------+
|         The         |         The         |         The         |
+---------------------+---------------------+---------------------+
|        quick        |          -          |         bad         |
+---------------------+---------------------+---------------------+
|        brown        |        brown        |          -          |
+---------------------+---------------------+---------------------+
| fox jumped over the | fox jumped over the | fox jumped over the |
+---------------------+---------------------+---------------------+
|         lazy        |          -          |         lazy        |
+---------------------+---------------------+---------------------+
|         dog.        |         dog.        |         dog.        |
+---------------------+---------------------+---

### Jupyter Notebook cells order

You may have noticed that if you run the cells in the Notebook in order, they know about one another. For this reason, in the end of this tutorial we could produce different outputs using the information typed into the previous cells. When you open a notebook, remember to run the cells in order or to "run all cells" (from the menu Cell), otherwise you may get an error message.

## Recap and exercise

Before moving forward and see how to collate texts stored in files and discover the various outputs that CollateX provide, let's recap what we've done and exercise a bit. 

We are using



1. First, create a new Markdown cell at the end of this Notebook (you could also create a new Notebook, but we'll save time by working in this one). Write in the new cell something like `My CollateX test`, so you know that this is your tests from that cell onwards. You can use the Markdown cells to document what is happening around them.

2. Then, create a Code cell and copy the code here below: this is all CollateX needs to collate some texts, the same instructions we gave it before but all together.

3. Now run the cell a first time and see the results.

4. Make changes and see how the output changes when you run the cell again. Change one thing at a time: this way, if you get an error message, it will be easier to debug the code. Try the following changes: 
    1. Change the text for each witness
    2. Set the segmentation option to True (you will see that it is the same as deleting it)
    3. Add a new witness
    4. It is also possible to change the sigil for each witness. The sigil is the abbreviation used for refering to a witness, here 'A', 'B', 'C'.



In [13]:
from collatex import *
collation = Collation()
collation.add_plain_witness( "A", "Some text here")
collation.add_plain_witness( "B", "Some text here as well" )
collation.add_plain_witness( "C", "Some text in the third witness as well" )
alignment_table = collate(collation, layout='vertical', segmentation=False)
print( alignment_table )

+------+------+---------+
|  A   |  B   |    C    |
+------+------+---------+
| Some | Some |   Some  |
+------+------+---------+
| text | text |   text  |
+------+------+---------+
| here | here |    in   |
+------+------+---------+
|  -   |  -   |   the   |
+------+------+---------+
|  -   |  -   |  third  |
+------+------+---------+
|  -   |  -   | witness |
+------+------+---------+
|  -   |  as  |    as   |
+------+------+---------+
|  -   | well |   well  |
+------+------+---------+


# *DELETE OR MOVE FROM HERE*

Okay, that's all good and nice, but that's just tiny fragments—we want decent chunks of text to collate! Well, we can do that too, although it requires a little more work. Specifically for reading in text files from the file system. If we didn't do it that way, we would have to key in all the characters of each witness, and that's just a lot of unnecessary work if we have those texts already in a file. The code below uses the `open` command to open each text file and assign the contents to a variable with an appropriately chosen name.

The `encoding="utf-8"` bit is needed because you should always tell Python which encoding your data uses. This is probably the only place and time where you will use that encoding directive: when you open a (text) file.

In [None]:
collation = Collation()
witness_1859 = open( "../data/Darwin/txt/darwin1859_par1.txt", encoding='utf-8' ).read()
witness_1860 = open( "../data/Darwin/txt/darwin1860_par1.txt", encoding='utf-8' ).read()
witness_1861 = open( "../data/Darwin/txt/darwin1861_par1.txt", encoding='utf-8' ).read()
witness_1866 = open( "../data/Darwin/txt/darwin1866_par1.txt", encoding='utf-8' ).read()
witness_1869 = open( "../data/Darwin/txt/darwin1869_par1.txt", encoding='utf-8' ).read()
witness_1872 = open( "../data/Darwin/txt/darwin1872_par1.txt", encoding='utf-8' ).read()
collation.add_plain_witness( "1859", witness_1859 )
collation.add_plain_witness( "1860", witness_1860 )
collation.add_plain_witness( "1861", witness_1861 )
collation.add_plain_witness( "1866", witness_1866 )
collation.add_plain_witness( "1869", witness_1869 )
collation.add_plain_witness( "1872", witness_1872 )

Now let's check if these witnesses actually contain some text by printing a few of them.

In [None]:
print( witness_1859 )

In [None]:
print( witness_1860 )

And now let's collate those witnesses and let's put the result up as an HTML-formatted alignment table…

In [None]:
alignment_table = collate(collation, layout='vertical', output='html')

Hmm… that is still a little hard to read. Wouldn't it be nice if we got a hint where the actual differences are? Sure, try…

In [None]:
alignment_table = collate(collation, layout='vertical', output='html2')

And finally, we can also generate the variant graph for this collation…

In [None]:
graph = collate( collation, output="svg" )

**Note**: you may have noticed that **if you run the cells in an IPython notebook in order, they know about one another**. For this reason, in the end of this tutorial we could produce different outputs using the information typed into the previous cells. When you open a notebook, remember to run the cells in order or to "run all cells" (from the menu Cell), otherwise you may get an error message.