# Session 5b - Collate with CollateX (reading files)



In this exercise, follow the instructions here: read the Markdown cells and execute the Code cells (the ones with In + a number on their left).

Not sure how to execute cells in a Notebook? Check the [Jupyter Notebook tutorial](../02_PrepareEnvironment/JupyterNotebook.ipynb).

### Delete the outputs

In this notebook, you may already have outputs - the results of the exercises. We want to start from scratch, let's delete the outputs!

- Go to the menu 'Kernel'
- click on 'Restart & Clear Outputs' and confirm when Jupyter asks for it
- Wait a few seconds, a blue string appears telling 'Kernel ready' ('Noyau prêt'); if you don't see it, don't worry, it is so quick that you might have lost it. But the Notebook is ready again.




### Update Collatex

We want to make sure that we are using the latest version of CollateX. You don't need to do it every time, but it is best to do it regularly - so we are running this at the beginning of the notebook:


In [None]:
!pip install --upgrade collatex


## 1. First exercise (Tolkien texts). Read from files

### Read text from files

Now we want to open the texts in "../data/Tolkien" and let Python read them. The Pyftan fragment is the first manuscript fragment of the story that will become *The Hobbit* (published in 1937) and then there is a typescript and a late edition (1995).

The code below is how Python read a file: it is not CollateX code, but general Python way of doing things. You have already seen it in the [introduction to Python](https://github.com/automaticCollationLausanne2020/Materials/blob/master/session2/Session02_python_introduction.ipynb). 

Each file is opened, read (using a specific character encoding) and stored in a variable ('witness_1859', etc.). Remember: the name of the variable cannot contain whitespaces!

In [None]:
pyftanFrgmt = open( "../data/Tolkien/bilbo-pryftan-frgmt.txt", encoding='utf-8' ).read()
bladorthinTpscr = open( "../data/Tolkien/bilbo-bladorthin-tpscr.txt", encoding='utf-8' ).read()
edition1995 = open( "../data/Tolkien/bilbo-edition-1995.txt", encoding='utf-8' ).read()

Just to be sure that the text in the files has been stored, try to print one of them.

In [None]:
print(bladorthinTpscr)

Or another one

In [None]:
print(edition1995)

### Import CollateX and create a collation object

Import the *collatex* Python library

In [None]:
from collatex import *

Create a collation object

In [None]:
collation = Collation()

### Add witnesses to the collation object

This is similar to what we've done in the previous exercise, but instead of the text we put here the variable containing the text read from the files.

In [None]:
collation.add_plain_witness( "witness fragment Pyftan", pyftanFrgmt )
collation.add_plain_witness( "witness typescript Bradorthin", bladorthinTpscr )
collation.add_plain_witness( "witness edition 1955", edition1995 )

### Collate and visualize the result

When you create the collation result, use the output option to specify the output you want. Here, set to 'hmlt2'. We will see more about the output options in the next session!

In [None]:
collate(collation, output="html2")

# *Your turn!*

## 2. Second exercise (Woolf texts). Read from files

In the second exercise, we repeat the previous steps, now using the texts at "../data/Woolf/Lighthouse" and visualizing the output.

We will be using different editions of Virginia Woolf's *To the lighthouse*:

    USA = New York: Harcourt, Brace & Company, 1927 (1st USA edition)
    UK = Londond: R & R Clark Limited, 1827 (1st UK edition)
    EM (EVERYMAN) = London: J. M. Dent & Sons LTD, 1938 (reprint 1952)

The facsimiles and trascriptions of the editions are available at http://woolfonline.com/. Please refer to the information in the data directory for the materials licence.

Check the code below and be sure to understand what each line does.

In [None]:
witness_USA = open( "../data/Woolf/Lighthouse/LighthouseUSA.txt", encoding='utf-8' ).read()
witness_UK = open( "../data/Woolf/Lighthouse/LighthouseUK.txt", encoding='utf-8' ).read()
witness_EM = open( "../data/Woolf/Lighthouse/LighthouseEM.txt", encoding='utf-8' ).read()

from collatex import *
collation = Collation()
collation.add_plain_witness( "USA", witness_USA )
collation.add_plain_witness( "UK", witness_UK )
collation.add_plain_witness( "EM", witness_EM )
collate(collation, output='html2')

# *Your turn!*

## 3. Still have time? Third exercise (Darwin texts). Read from files

Now you know how to collate texts stored in files. Try with the other materials inside the data directory: the first paragraph of Darwin's *On the Origin of Species*. 
- Create a new code cell, here below, with the button **+**.
- Read the files, 
- test that they have been read by printing one of them, and finally 
- import CollateX, create a collation object, add the witnesses, collate and visualize the result.